Detecting mutations and ploidy in chromosomal segments

ABSTRACT

The invention provides methods, systems, and computer readable medium for detecting ploidy of chromosome segments or entire chromosomes, for detecting single nucleotide variants and for detecting both ploidy of chromosome segments and single nucleotide variants. In some aspects, the invention provides methods, systems, and computer readable medium for detecting cancer or a chromosomal abnormality in a gestating fetus.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. Utility application Ser. No.17/959,543, filed Oct. 4, 2022. U.S. Utility application Ser. No.17/959,543 is a continuation of U.S. Utility application Ser. No.17/738,354, filed May 6, 2022. U.S. Utility application Ser. No.17/738,354 is a continuation of U.S. Utility application Ser. No.17/692,469, filed Mar. 11, 2022. U.S. Utility application Ser. No.17/692,469 is a continuation of U.S. Utility application Ser. No.15/898,145, filed Feb. 15, 2018 (now U.S. Pat. No. 11,319,595). U.S.Utility application Ser. No. 15/898,145 is a continuation of U.S.Utility application Ser. No. 14/692,703, filed Apr. 21, 2015 (now U.S.Pat. No. 10,179,937), which claims the benefit of and priority to U.S.Provisional Application Ser. No. 61/982,245, filed Apr. 21, 2014; U.S.Provisional Application Ser. No. 61/987,407, filed May 1, 2014; U.S.Provisional Application Ser. No. 62/066,514, filed Oct. 21, 2014; U.S.Provisional Application Ser. No. 62/146,188, filed Apr. 10, 2015; U.S.Provisional Application Ser. No. 62/147,377, filed Apr. 14, 2015; U.S.Provisional Application Ser. No. 62/148,173, filed Apr. 15, 2015, theentirety of these applications are hereby incorporated herein byreference for the teachings therein.

FIELD OF THE INVENTION

The present invention generally relates to methods and systems fordetecting ploidy of a chromosome segment, and methods and systems fordetecting a single nucleotide variant.

BACKGROUND OF THE INVENTION

Copy number variation (CNV) has been identified as a major cause ofstructural variation in the genome, involving both duplications anddeletions of sequences that typically range in length from 1,000 basepairs (1 kb) to 20 megabases (mb). Deletions and duplications ofchromosome segments or entire chromosomes are associated with a varietyof conditions, such as susceptibility or resistance to disease.

CNVs are often assigned to one of two main categories, based on thelength of the affected sequence. The first category includes copy numberpolymorphisms (CNPs), which are common in the general population,occurring with an overall frequency of greater than 1%. CNPs aretypically small (most are less than 10 kilobases in length), and theyare often enriched for genes that encode proteins important in drugdetoxification and immunity. A subset of these CNPs is highly variablewith respect to copy number. As a result, different human chromosomescan have a wide range of copy numbers (e.g., 2, 3, 4, 5, etc.) for aparticular set of genes. CNPs associated with immune response genes haverecently been associated with susceptibility to complex geneticdiseases, including psoriasis, Crohn's disease, and glomerulonephritis.

The second class of CNVs includes relatively rare variants that are muchlonger than CNPs, ranging in size from hundreds of thousands of basepairs to over 1 million base pairs in length. In some cases, these CNVsmay have arisen during production of the sperm or egg that gave rise toa particular individual, or they may have been passed down for only afew generations within a family. These large and rare structuralvariants have been observed disproportionately in subjects with mentalretardation, developmental delay, schizophrenia, and autism. Theirappearance in such subjects has led to speculation that large and rareCNVs may be more important in neurocognitive diseases than other formsof inherited mutations, including single nucleotide substitutions.

Gene copy number can be altered in cancer cells. For instance,duplication of Chr1p is common in breast cancer, and the EGFR copynumber can be higher than normal in non-small cell lung cancer. Canceris one of the leading causes of death; thus, early diagnosis andtreatment of cancer is important, since it can improve the patient'soutcome (such as by increasing the probability of remission and theduration of remission). Early diagnosis can also allow the patient toundergo fewer or less drastic treatment alternatives. Many of thecurrent treatments that destroy cancerous cells also affect normalcells, resulting in a variety of possible side-effects, such as nausea,vomiting, low blood cell counts, increased risk of infection, hair loss,and ulcers in mucous membranes. Thus, early detection of cancer isdesirable since it can reduce the amount and/or number of treatments(such as chemotherapeutic agents or radiation) needed to eliminate thecancer.

Copy number variation has also been associated with severe mental andphysical handicaps, and idiopathic learning disability. Non-invasiveprenatal testing (NIPT) using cell-free DNA (cfDNA) can be used todetect abnormalities, such as fetal trisomies 13, 18, and 21, triploidy,and sex chromosome aneuploidies. Subchromosomal microdeletions, whichcan also result in severe mental and physical handicaps, are morechallenging to detect due to their smaller size. Eight of themicrodeletion syndromes have an aggregate incidence of more than 1 in1000, making them nearly as common as fetal autosomal trisomies.

In addition, a higher copy number of CCL3L1 has been associated withlower susceptibility to HIV infection, and a low copy number of FCGR3B(the CD16 cell surface immunoglobulin receptor) can increasesusceptibility to systemic lupus erythematosus and similar inflammatoryautoimmune disorders.

Thus, improved methods are needed to detect deletions and duplicationsof chromosome segments or entire chromosomes. Preferably, these methodscan be used to more accurately diagnose disease or an increased risk ofdisease, such as cancer or CNVs in a gestating fetus.

SUMMARY OF THE INVENTION

In illustrative embodiments, provided herein is a method for determiningploidy of a chromosomal segment in a sample of an individual. The methodincludes the following steps:

-   -   a. receiving allele frequency data comprising the amount of each        allele present in the sample at each loci in a set of        polymorphic loci on the chromosomal segment;    -   b. generating phased allelic information for the set of        polymorphic loci by estimating the phase of the allele frequency        data;    -   c. generating individual probabilities of allele frequencies for        the polymorphic loci for different ploidy states using the        allele frequency data;    -   d. generating joint probabilities for the set of polymorphic        loci using the individual probabilities and the phased allelic        information; and    -   e. selecting, based on the joint probabilities, a best fit model        indicative of chromosomal ploidy, thereby determining ploidy of        the chromosomal segment.

In one illustrative embodiment of the method for determining ploidy, thedata is generated using nucleic acid sequence data, especially highthroughput nucleic acid sequence data. In certain illustrative examplesof the method for determining ploidy, the allele frequency data iscorrected for errors before it is used to generate individualprobabilities. In specific illustrative embodiments, the errors that arecorrected include allele amplification efficiency bias. In otherembodiments, the errors that are corrected include ambient contaminationand genotype contamination. In some embodiments, errors that arecorrected include allele amplification bias, ambient contamination andgenotype contamination.

In certain embodiments of the method for determining ploidy, theindividual probabilities are generated using a set of models of bothdifferent ploidy states and allelic imbalance fractions for the set ofpolymorphic loci. In these embodiments, and other embodiments, the jointprobabilities are generated by considering the linkage betweenpolymorphic loci on the chromosome segment.

Accordingly, in one illustrative embodiment that combines some of theseembodiments, provided herein is a method for detecting chromosomalploidy in a sample of an individual, that includes the following steps:

-   -   a. receiving nucleic acid sequence data for alleles at a set of        polymorphic loci on a chromosome segment in the individual;    -   b. detecting allele frequencies at the set of loci using the        nucleic acid sequence data;    -   c. correcting for allele amplification efficiency bias in the        detected allele frequencies to generate corrected allele        frequencies for the set of polymorphic loci;    -   d. generating phased allelic information for the set of        polymorphic loci by estimating the phase of the nucleic acid        sequence data;    -   e. generating individual probabilities of allele frequencies for        the polymorphic loci for different ploidy states by comparing        the corrected allele frequencies to a set of models of different        ploidy states and allelic imbalance fractions of the set of        polymorphic loci;    -   f. generating joint probabilities for the set of polymorphic        loci by combining the individual probabilities considering the        linkage between polymorphic loci on the chromosome segment; and    -   g. selecting, based on the joint probabilities, the best fit        model indicative of chromosomal aneuploidy.

In another aspect, provided herein is a system for detecting chromosomalploidy in a sample of an individual, the system comprising:

-   -   a. an input processor configured to receive allelic frequency        data comprising the amount of each allele present in the sample        at each loci in a set of polymorphic loci on the chromosomal        segment;    -   b. a modeler configured to:        -   i. generate phased allelic information for the set of            polymorphic loci by estimating the phase of the allele            frequency data; and        -   ii. generate individual probabilities of allele frequencies            for the polymorphic loci for different ploidy states using            the allele frequency data; and        -   iii. generate joint probabilities for the set of polymorphic            loci using the individual probabilities and the phased            allelic information; and    -   c. a hypothesis manager configured to select, based on the joint        probabilities, a best fit model indicative of chromosomal        ploidy, thereby determining ploidy of the chromosomal segment.

In certain embodiments of this system embodiment, the allele frequencydata is data generated by a nucleic acid sequencing system. In certainembodiments, the system further comprises an error correction unitconfigured to correct for errors in the allele frequency data, whereinthe corrected allele frequency data is used by the modeler for togenerate individual probabilities. In certain embodiments the errorcorrection unit corrects for allele amplification efficiency bias. Incertain embodiments, the modeler generates the individual probabilitiesusing a set of models of both different ploidy states and allelicimbalance fractions for the set of polymorphic loci. The modeler, incertain exemplary embodiments generates the joint probabilities byconsidering the linkage between polymorphic loci on the chromosomesegment.

In one illustrative embodiment, provided herein is a system fordetecting chromosomal ploidy in a sample of an individual, that includesthe following:

-   -   a. an input processor configured to receive nucleic acid        sequence data for alleles at a set of polymorphic loci on a        chromosome segment in the individual and detect allele        frequencies at the set of loci using the nucleic acid sequence        data;    -   b. an error correction unit configured to correct for errors in        the detected allele frequencies and generate corrected allele        frequencies for the set of polymorphic loci;    -   c. a modeler configured to:        -   i. generate phased allelic information for the set of            polymorphic loci by estimating the phase of the nucleic acid            sequence data;        -   ii. generate individual probabilities of allele frequencies            for the polymorphic loci for different ploidy states by            comparing the phased allelic information to a set of models            of different ploidy states and allelic imbalance fractions            of the set of polymorphic loci; and        -   iii. generate joint probabilities for the set of polymorphic            loci by combining the individual probabilities considering            the relative distance between polymorphic loci on the            chromosome segment; and    -   d. a hypothesis manager configured to select, based on the joint        probabilities, a best fit model indicative of chromosomal        aneuploidy.

In certain aspects, the present invention provides a method fordetermining whether circulating tumor nucleic acids are present in asample in an individual, comprising

-   -   a. analyzing the sample to determine a ploidy at a set of        polymorphic loci on a chromosome segment in the individual; and    -   b. determining the level of allelic imbalance present at the        polymorphic loci based on the ploidy determination, wherein an        allelic imbalance equal to or greater than 0.4%, 0.45%, or 0.5%        is indicative of the presence of circulating tumor nucleic acids        in the sample.

In certain embodiments the method for determining whether circulatingtumor nucleic acids are present, further comprises detecting a singlenucleotide variant at a single nucleotide variance site in a set ofsingle nucleotide variance locations, wherein detecting either anallelic imbalance equal to or greater than 45% or detecting the singlenucleotide variant, or both, is indicative of the presence ofcirculating tumor nucleic acids in the sample.

In certain embodiments analyzing step in the method for determiningwhether circulating tumor nucleic acids are present, includes analyzinga set of chromosome segments known to exhibit aneuploidy in cancer. Incertain embodiments analyzing step in the method for determining whethercirculating tumor nucleic acids are present, includes analyzing between1,000 and 50,000 or between 100 and 1000, polymorphic loci for ploidy.

In certain aspects, provided herein are methods for detecting singlenucleotide variants in a sample. Accordingly, provided herein is amethod for determining whether a single nucleotide variant is present ata set of genomic positions in a sample from an individual, the methodcomprising:

-   -   a. for each genomic position, generating an estimate of        efficiency and a per cycle error rate for an amplicon spanning        that genomic position, using a training data set;    -   b. receiving observed nucleotide identity information for each        genomic position in the sample;    -   c. determining a set of probabilities of single nucleotide        variant percentage resulting from one or more real mutations at        each genomic position, by comparing the observed nucleotide        identity information at each genomic position to a model of        different variant percentages using the estimated amplification        efficiency and the per cycle error rate for each genomic        position independently; and    -   d. determining the most-likely real variant percentage and        confidence from the set of probabilities for each genomic        position.

In illustrative embodiments of the method for determining whether asingle nucleotide variant is present, the estimate of efficiency and theper cycle error rate is generated for a set of amplicons that span thegenomic position. For example, 2, 3, 4, 5, 10, 15, 20, 25, 50, 100 ormore amplicons can be included that span the genomic position. Incertain embodiments of this method for detecting one or more SNVs thelimit of detection is 0.015%, 0.017%, or 0.02%.

In illustrative embodiments of the method for determining whether asingle nucleotide variant is present, the observed nucleotide identityinformation comprises an observed number of total reads for each genomicposition and an observed number of variant allele reads for each genomicposition.

In illustrative embodiments of the method for determining whether asingle nucleotide variant is present, the sample is a plasma sample andthe single nucleotide variant is present in circulating tumor DNA of thesample.

In another embodiment, provided herein is a method for detecting one ormore single nucleotide variants in a test sample from an individual. Themethod according to this embodiment, includes the following steps:

-   -   a. determining a median variant allele frequency for a plurality        of control samples from each of a plurality of normal        individuals, for each single nucleotide variant position in a        set of single nucleotide variance positions based on results        generated in a sequencing run, to identify selected single        nucleotide variant positions having variant median allele        frequencies in normal samples below a threshold value and to        determine background error for each of the single nucleotide        variant positions after removing outlier samples for each of the        single nucleotide variant positions;    -   b. determining an observed depth of read weighted mean and        variance for the selected single nucleotide variant positions        for the test sample based on data generated in the sequencing        run for the test sample; and    -   c. identifying using a computer, one or more single nucleotide        variant positions with a statistically significant depth of read        weighted mean compared to the background error for that        position, thereby detecting the one or more single nucleotide        variants.

In certain embodiments of this method for detecting one or more SNVs thesample is a plasma sample, the control samples are plasma samples, andthe detected one or more single nucleotide variants detected is presentin circulating tumor DNA of the sample. In certain embodiments of thismethod for detecting one or more SNVs the plurality of control samplescomprises at least 25 samples. In certain embodiments of this method fordetecting one or more SNVs, outliers are removed from the data generatedin the high throughput sequencing run to calculate the observed depth ofread weighted mean and observed variance are determined. In certainembodiments of this method for detecting one or more SNVs the depth ofread for each single nucleotide variant position for the test sample isat least 100 reads.

In certain embodiments of this method for detecting one or more SNVs thesequencing run comprises a multiplex amplification reaction performedunder limited primer reaction conditions. In certain embodiments of thismethod for detecting one or more SNVs the limit of detection is 0.015%,0.017%, or 0.02%.

In one aspect, the invention features a method of determining if thereis an overrepresentation of the number of copies of a first homologouschromosome segment as compared to a second homologous chromosome segmentin the genome of one or more cells from an individual. In someembodiments, the method includes obtaining phased genetic data for thefirst homologous chromosome segment comprising, the identity of theallele present at that locus on the first homologous chromosome segmentfor each locus in a set of polymorphic loci on the first homologouschromosome segment, obtaining phased genetic data for the secondhomologous chromosome segment comprising the identity of the allelepresent at that locus on the second homologous chromosome segment foreach locus in the set of polymorphic loci on the second homologouschromosome segment, and obtaining measured genetic allelic datacomprising the amount of each allele present in a sample of DNA or RNAfrom one or more cells from the individual, for each of the alleles ateach of the loci in the set of polymorphic loci. In some embodiments,the method includes enumerating a set of one or more hypothesesspecifying the degree of overrepresentation of the first homologouschromosome segment in the genome of one or more cells from theindividual, calculating (such as calculating on a computer) a likelihoodof one or more of the hypotheses based on the obtained genetic data ofthe sample and the obtained phased genetic data, and selecting thehypothesis with the greatest likelihood, thereby determining the degreeof overrepresentation of the number of copies of the first homologouschromosome segment in the genome of one or more cells from theindividual. In some embodiments, the phased data includes inferredphased data using population based haplotype frequencies and/or measuredphased data (e.g., phased data obtained by measuring a sample containingDNA or RNA from the individual or a relative of the individual).

In one aspect, the invention provides a method for determining if thereis an overrepresentation of the number of copies of a first homologouschromosome segment as compared to a second homologous chromosome segmentin the genome of one or more cells from an individual. In someembodiments, the method includes obtaining phased genetic data for thefirst homologous chromosome segment comprising the identity of theallele present at that locus on the first homologous chromosome segmentfor each locus in a set of polymorphic loci on the first homologouschromosome segment, obtaining phased genetic data for the secondhomologous chromosome segment comprising the identity of the allelepresent at that locus on the second homologous chromosome segment foreach locus in the set of polymorphic loci on the second homologouschromosome segment, and obtaining measured genetic allelic datacomprising the amount of each allele present in a sample of DNA or RNAfrom one or more cells from the individual for each of the alleles ateach of the loci in the set of polymorphic loci. In some embodiments,the method includes enumerating a set of one or more hypothesesspecifying the degree of overrepresentation of the first homologouschromosome segment; calculating, for each of the hypotheses, expectedgenetic data for the plurality of loci in the sample from the obtainedphased genetic data; calculating (such as calculating on a computer) thedata fit between the obtained genetic data of the sample and theexpected genetic data for the sample; ranking one or more of thehypotheses according to the data fit; and selecting the hypothesis thatis ranked the highest, thereby determining the degree ofoverrepresentation of the number of copies of the first homologouschromosome segment in the genome of one or more cells from theindividual.

In one aspect, the invention features a method for determining if thereis an overrepresentation of the number of copies of a first homologouschromosome segment as compared to a second homologous chromosome segmentin the genome of one or more cells from an individual. In someembodiments, the method includes obtaining phased genetic data for thefirst homologous chromosome segment comprising the identity of theallele present at that locus on the first homologous chromosome segmentfor each locus in a set of polymorphic loci on the first homologouschromosome segment, obtaining phased genetic data for the secondhomologous chromosome segment comprising the identity of the allelepresent at that locus on the second homologous chromosome segment foreach locus in the set of polymorphic loci on the second homologouschromosome segment, and obtaining measured genetic allelic datacomprising, for each of the alleles at each of the loci in the set ofpolymorphic loci, the amount of each allele present in a sample of DNAor RNA from one or more target cells and one or more non-target cellsfrom the individual. In some embodiments, the method includesenumerating a set of one or more hypotheses specifying the degree ofoverrepresentation of the first homologous chromosome segment;calculating (such as calculating on a computer), for each of thehypotheses, expected genetic data for the plurality of loci in thesample from the obtained phased genetic data for one or more possibleratios of DNA or RNA from the one or more target cells to the total DNAor RNA in the sample; calculating (such as calculating on a computer)for each possible ratio of DNA or RNA and for each hypothesis, the datafit between the obtained genetic data of the sample and the expectedgenetic data for the sample for that possible ratio of DNA or RNA andfor that hypothesis; ranking one or more of the hypotheses according tothe data fit; and selecting the hypothesis that is ranked the highest,thereby determining the degree of overrepresentation of the number ofcopies of the first homologous chromosome segment in the genome of oneor more cells from the individual.

In one aspect, the invention features a method for determining if thereis an overrepresentation of the number of copies of a first homologouschromosome segment as compared to a second homologous chromosome segmentin the genome of one or more cells from an individual. In someembodiments, the method includes obtaining phased genetic data for thefirst homologous chromosome segment comprising the identity of theallele present at that locus on the first homologous chromosome segmentfor each locus in a set of polymorphic loci on the first homologouschromosome segment, obtaining phased genetic data for the secondhomologous chromosome segment comprising the identity of the allelepresent at that locus on the second homologous chromosome segment foreach locus in the set of polymorphic loci on the second homologouschromosome segment, and obtaining measured genetic allelic datacomprising the amount of each allele present in a sample of DNA or RNAfrom one or more target cells and one or more non-target cells from theindividual for each of the alleles at each of the loci in the set ofpolymorphic loci. In some embodiments, the method includes enumerating aset of one or more hypotheses specifying the degree ofoverrepresentation of the first homologous chromosome segment;calculating (such as calculating on a computer), for each of thehypotheses, expected genetic data for the plurality of loci in thesample from the obtained phased genetic data for one or more possibleratios of DNA or RNA from the one or more target cells to the total DNAor RNA in the sample; calculating (such as calculating on a computer)for each locus in the plurality of loci, each possible ratio of DNA orRNA, and each hypothesis, the likelihood that the hypothesis is correctby comparing the obtained genetic data of the sample for that locus andthe expected genetic data for that locus for that possible ratio of DNAor RNA and for that hypothesis; determining the combined probability foreach hypothesis by combining the probabilities of that hypothesis foreach locus and each possible ratio; and selecting the hypothesis withthe greatest combined probability, thereby determining the degree ofoverrepresentation of the number of copies of the first homologouschromosome segment. In some embodiments, all of the loci are consideredat once to calculate the probability of a particular hypothesis, and thehypothesis with the greatest probability is selected.

In one aspect, the invention features a method for determining a numberof copies of a chromosome segment of interest in the genome of a fetus.In some embodiments, the method includes obtaining phased genetic datafor at least one biological parent of the fetus, wherein the phasedgenetic data comprises the identity of the allele present for each locusin a set of polymorphic loci on a first homologous chromosome segmentand a second homologous chromosome segment in a pair of homologouschromosome segments that comprises the chromosome segment of interest.In some embodiments, the method includes obtaining genetic data at theset of polymorphic loci on the chromosome segment of interest in a mixedsample of DNA or RNA comprising fetal DNA or RNA and maternal DNA or RNAfrom the mother of the fetus by measuring the quantity of each allele ateach locus. In some embodiments, the method includes enumerating a setof one or more hypotheses specifying the number of copies of thechromosome segment of interest present in the genome of the fetus. Insome embodiments, the method includes enumerating a set of one or morehypotheses specifying, for one or both parents, the number of copies ofthe first homologous chromosome segment or portion thereof from theparent in the genome of the fetus, the number of copies of the secondhomologous chromosome segment or portion thereof from the parent in thegenome of the fetus, and the total number of copies of the chromosomesegment of interest present in the genome of the fetus. In someembodiments, the method includes calculating (such as calculating on acomputer), for each of the hypotheses, expected genetic data for theplurality of loci in the mixed sample from the obtained phased geneticdata from the parent(s); calculating (such as calculating on a computer)the data fit between the obtained genetic data of the mixed sample andthe expected genetic data for the mixed sample; ranking one or more ofthe hypotheses according to the data fit; and selecting the hypothesisthat is ranked the highest, thereby determining the number of copies ofthe chromosome segment of interest in the genome of the fetus.

In one aspect, the invention features a method for determining a numberof copies of a chromosome or chromosome segment of interest in thegenome of a fetus. In some embodiments, the method includes obtainingphased genetic data for at least one biological parent of the fetus,wherein the phased genetic data comprises the identity of the allelepresent for each locus in a set of polymorphic loci on a firsthomologous chromosome segment and a second homologous chromosome segmentin the parent. In some embodiments, the method includes obtaininggenetic data at the set of polymorphic loci on the chromosome orchromosome segment in a mixed sample of DNA or RNA comprising fetal DNAor RNA and maternal DNA or RNA from the mother of the fetus by measuringthe quantity of each allele at each locus. In some embodiments, themethod includes enumerating a set of one or more hypotheses specifyingthe number of copies of the chromosome or chromosome segment of interestpresent in the genome of the fetus. In some embodiments, the methodincludes creating (such as creating on a computer) for each of thehypotheses, a probability distribution of the expected quantity of eachallele at each of the plurality of loci in mixed sample from the (i) theobtained phased genetic data from the parent(s) and (ii) optionally theprobability of one or more crossovers that may have occurred during theformation of a gamete that contributed a copy of the chromosome orchromosome segment of interest to the fetus; calculating (such ascalculating on a computer) a fit, for each of the hypotheses, between(1) the obtained genetic data of the mixed sample and (2) theprobability distribution of the expected quantity of each allele at eachof the plurality of loci in mixed sample for that hypothesis; rankingone or more of the hypotheses according to the data fit; and selectingthe hypothesis that is ranked the highest, thereby determining thenumber of copies of the chromosome segment of interest in the genome ofthe fetus.

In some embodiments, the method includes obtaining phased genetic datafor the mother of the fetus. In some embodiments, the method includesenumerating a set of one or more hypotheses specifying the number ofcopies of the first homologous chromosome segment or portion thereoffrom the mother in the genome of the fetus, the number of copies of thesecond homologous chromosome segment or portion thereof from the motherin the genome of the fetus, and the total number of copies of thechromosome segment of interest present in the genome of the fetus. Insome embodiments, the method includes calculating, for each of thehypotheses, expected genetic data for the plurality of loci in the mixedsample from the obtained phased genetic data from the mother.

In some embodiments, the expected genetic data for each of thehypotheses comprises the identity and an amount of one or more allelesat each locus in the plurality of loci from the maternal DNA or RNA andfetal DNA or RNA in the mixed sample. In some embodiments, the methodincludes calculating (such as calculating on a computer) expectedgenetic data by determining a fraction of fetal DNA or RNA and afraction of maternal DNA or RNA in the mixed sample. In someembodiments, the method includes calculating, for each locus in theplurality of loci, the expected amount of one or more of the alleles forthat locus in the maternal DNA or RNA in the mixed sample using theidentity of the allele(s) present at that locus in the obtained phasedgenetic data of the mother and the fraction of maternal DNA or RNA inthe mixed sample. In some embodiments, the method includes calculating(such as calculating on a computer), for each locus in the plurality ofloci for each hypothesis, the expected amount of one or more of thealleles for that locus in the fetal DNA or RNA inherited from the motherin the mixed sample using the identity of the allele present at thatlocus in the first or second homologous chromosome segment from themother that is specified by the hypothesis to have been inherited by thefetus, the number of copies of the first or second homologous chromosomesegment from the mother that is specified by the hypothesis to have beeninherited by the fetus, and the fraction of fetal DNA or RNA in themixed sample.

In some embodiments, the expected genetic data for each of thehypotheses comprises the identity and an amount of one or more allelesat each locus in the plurality of loci from the maternal DNA or RNA andfetal DNA or RNA in the mixed sample. In some embodiments, the methodincludes calculating expected genetic data by determining a fraction offetal DNA or RNA and a fraction of maternal DNA or RNA in the mixedsample. In some embodiments, the method includes calculating (such ascalculating on a computer), for each locus in the plurality of loci, theexpected amount of one or more of the alleles for that locus in thematernal DNA or RNA in the mixed sample using the identity of theallele(s) present at that locus in the obtained phased genetic data ofthe mother and the fraction of maternal DNA or RNA in the mixed sample.In some embodiments, the method includes calculating (such ascalculating on a computer), for each locus in the plurality of loci foreach hypothesis, the expected amount of one or more of the alleles forthat locus in the fetal DNA or RNA inherited from the mother in themixed sample using the identity of the allele present at that locus inthe first or second homologous chromosome segment from the mother thatis specified by the hypothesis to have been inherited by the fetus, thenumber of copies of the first or second homologous chromosome segmentfrom the mother that is specified by the hypothesis to have beeninherited by the fetus, the identity of one or more possible alleles atthat locus in the first or second homologous chromosome segment from thefather that is specified by the hypothesis to have been inherited by thefetus, the number of copies of the first or second homologous chromosomesegment from the father that is specified by the hypothesis to have beeninherited by the fetus, and the fraction of fetal DNA or RNA in themixed sample. In some embodiments, population frequencies are used topredict the identity of the alleles in the first or second homologouschromosome segment from the father. In some embodiments, the probabilityfor each of the possible alleles at each locus in the first or secondhomologous chromosome segment from the father are considered to be thesame.

In some embodiments, the method includes obtaining phased genetic datafor both the mother and father of the fetus. In some embodiments, themethod includes enumerating a set of one or more hypotheses specifyingthe number of copies of the first homologous chromosome segment orportion thereof from the mother in the genome of the fetus, the numberof copies of the second homologous chromosome segment or portion thereoffrom the mother in the genome of the fetus, the number of copies of thefirst homologous chromosome segment or portion thereof from the fatherin the genome of the fetus, the number of copies of the secondhomologous chromosome segment or portion thereof from the father in thegenome of the fetus, and the total number of copies of the chromosomesegment of interest present in the genome of the fetus. In someembodiments, the method includes calculating (such as calculating on acomputer), for each of the hypotheses, expected genetic data for theplurality of loci in the mixed sample from the obtained phased geneticdata from the mother and obtained phased genetic data from the father.

In some embodiments, the expected genetic data for each of thehypotheses comprises the identity and an amount of one or more allelesat each locus in the plurality of loci from the maternal DNA or RNA andfetal DNA or RNA in the mixed sample. In some embodiments, the methodincludes calculating expected genetic data by determining a fraction offetal DNA or RNA and a fraction of maternal DNA or RNA in the mixedsample. In some embodiments, the method includes calculating (such ascalculating on a computer), for each locus in the plurality of loci, theexpected amount of one or more of the alleles for that locus in thematernal DNA or RNA in the mixed sample using the identity of theallele(s) present at that locus in the obtained phased genetic data ofthe mother and the fraction of maternal DNA or RNA in the mixed sample.In some embodiments, the method includes calculating (such ascalculating on a computer), for each locus in the plurality of loci foreach hypothesis, the expected amount of one or more of the alleles forthat locus in the fetal DNA or RNA in the mixed sample using theidentity of the allele present at that locus in the first or secondhomologous chromosome segment from the mother that is specified by thehypothesis to have been inherited by the fetus, the number of copies ofthe first or second homologous chromosome segment from the mother thatis specified by the hypothesis to have been inherited by the fetus, theidentity of the allele present at that locus in the first or secondhomologous chromosome segment from the father that is specified by thehypothesis to have been inherited by the fetus, the number of copies ofthe first or second homologous chromosome segment from the father thatis specified by the hypothesis to have been inherited by the fetus, andthe fraction of fetal DNA or RNA in the mixed sample.

In some embodiments, the method includes calculating (such ascalculating on a computer), for each of the hypotheses, a probabilitydistribution of expected genetic data for the plurality of loci in themixed sample from the obtained phased genetic data from the parent(s).In some embodiments, the method includes increasing the probability inthe probability distribution of an a particular allele being present ata first locus in the mixed sample if that particular allele is presentin the first homologous segment in the parent and an allele at a nearbylocus in the first homologous segment in the parent is observed in theobtained genetic data of the mixed sample; or decreasing the probabilityin the probability distribution of an a particular allele being presentat a first locus in the mixed sample if that particular allele ispresent in the first homologous segment in the parent and an allele at anearby locus in the first homologous segment in the parent is notobserved in the obtained genetic data of the mixed sample. In someembodiments, the method includes increasing the probability in theprobability distribution of an a particular allele being present at asecond locus in the mixed sample if that particular allele is present inthe second homologous segment in the parent and an allele at a nearbylocus in the second homologous segment in the parent is observed in theobtained genetic data of the mixed sample; or decreasing the probabilityin the probability distribution of an a particular allele being presentat a second locus in the mixed sample if that particular allele ispresent in the second homologous segment in the parent and an allele ata nearby locus in the second homologous segment in the parent is notobserved in the obtained genetic data of the mixed sample.

In some embodiments, the method includes obtaining phased genetic datafor both the mother and father of the fetus. In some embodiments, themethod includes enumerating a set of one or more hypotheses specifyingthe number of copies of the first homologous chromosome segment orportion thereof from the mother in the genome of the fetus, the numberof copies of the second homologous chromosome segment or portion thereoffrom the mother in the genome of the fetus, the number of copies of thefirst homologous chromosome segment or portion thereof from the fatherin the genome of the fetus, the number of copies of the secondhomologous chromosome segment or portion thereof from the father in thegenome of the fetus, and the total number of copies of the chromosomesegment of interest present in the genome of the fetus. In someembodiments, the method includes calculating (such as calculating on acomputer), for each of the hypotheses, a probability distribution ofexpected genetic data for the plurality of loci in the mixed sample fromthe obtained phased genetic data from the mother and father. In someembodiments, the method includes increasing the probability in theprobability distribution of an a particular allele being present at afirst locus in the mixed sample if that particular allele is present inthe first homologous segment in the mother or father and an allele at anearby locus in the first homologous segment in that parent is observedin the obtained genetic data of the mixed sample; or decreasing theprobability in the probability distribution of an a particular allelebeing present at a first locus in the mixed sample if that particularallele is present in the first homologous segment in the mother orfather and an allele at a nearby locus in the first homologous segmentin that parent is not observed in the obtained genetic data of the mixedsample. In some embodiments, the method includes increasing theprobability in the probability distribution of an a particular allelebeing present at a second locus in the mixed sample if that particularallele is present in the second homologous segment in the mother orfather and an allele at a nearby locus in the second homologous segmentin that parent is observed in the obtained genetic data of the mixedsample; or decreasing the probability in the probability distribution ofan a particular allele being present at a second locus in the mixedsample if that particular allele is present in the second homologoussegment in the mother or father and an allele at a nearby locus in thesecond homologous segment in that parent is not observed in the obtainedgenetic data of the mixed sample.

In some embodiments, the first locus and the locus that is nearby to thefirst locus co-segregate. In some embodiments, the second locus and thelocus that is nearby to the second locus co-segregate. In someembodiments, no crossovers are expected to occur between the first locusand the locus that is nearby to the first locus. In some embodiments, nocrossovers are expected to occur between the second locus and the locusthat is nearby to the second locus. In some embodiments, the distancebetween the first locus and the locus that is nearby to the first locusis less than 5 mb, 1 mb, 100 kb, 10 kb, 1 kb, 0.1 kb, or 0.01 kb. Insome embodiments, the distance between the second locus and the locusthat is nearby to the second locus is less than 5 mb, 1 mb, 100 kb, 10kb, 1 kb, 0.1 kb, or 0.01 kb.

In some embodiments, one or more cross overs occurs during the formationof a gamete that contributed a copy of the chromosome segment ofinterest to the fetus; and the crossover produces a chromosome segmentof interest in the genome of the fetus that comprises a portion of thefirst homologous segment and a portion of the second homologous segmentfrom the parent. In some embodiments, the set of hypothesis comprisesone or more hypotheses specifying the number of copies of the chromosomesegment of interest in the genome of the fetus that comprises a portionof the first homologous segment and a portion of the second homologoussegment from the parent.

In some embodiments, the expected genetic data of the mixed samplecomprises the expected amount of one or more of the alleles at eachlocus in the plurality of loci in the mixed sample for each of thehypotheses.

In one aspect, the invention features a method of determining if thereis an overrepresentation of the number of copies of a first homologouschromosome segment as compared to a second homologous chromosome segmentin the genome of an individual (such as in the genome of one or morecells, cfDNA, cfRNA, an individual suspected of having cancer, a fetus,or an embryo) using phased genetic data. In some embodiments, the methodinvolves simultaneously or sequentially in any order (i) obtainingphased genetic data for the first homologous chromosome segmentcomprising the identity of the allele present at that locus on the firsthomologous chromosome segment for each locus in a set of polymorphicloci on the first homologous chromosome segment, (ii) obtaining phasedgenetic data for the second homologous chromosome segment comprising theidentity of the allele present at that locus on the second homologouschromosome segment for each locus in the set of polymorphic loci on thesecond homologous chromosome segment, and (iii) obtaining measuredgenetic allelic data comprising the amount of each allele at each of theloci in the set of polymorphic loci in a sample of DNA or RNA from oneor more cells from the individual or in a mixed sample of cell-free DNAor RNA from two or more genetically different cells from the individual.In some embodiments, the method involves calculating allele ratios forone or more loci in the set of polymorphic loci that are heterozygous inat least one cell from which the sample was derived. In someembodiments, the calculated allele ratio for a particular locus is themeasured quantity of one of the alleles divided by the total measuredquantity of all the alleles for the locus. In some embodiments, themethod involves determining if there is an overrepresentation of thenumber of copies of the first homologous chromosome segment by comparingone or more calculated allele ratios for a locus to an expected alleleratio, such as a ratio that is expected for that locus if the first andsecond homologous chromosome segments are present in equal proportions.In some embodiments, the expected ratio is 0.5 for biallelic loci.

In some embodiments for prenatal testing, the method involvessimultaneously or sequentially in any order (i) obtaining phased geneticdata for the first homologous chromosome segment in the genome of afetus (such as a fetus gestating in a pregnant mother) comprising theidentity of the allele present at that locus on the first homologouschromosome segment for each locus in a set of polymorphic loci on thefirst homologous chromosome segment, (ii) obtaining phased genetic datafor the second homologous chromosome segment in the genome of the fetuscomprising the identity of the allele present at that locus on thesecond homologous chromosome segment for each locus in the set ofpolymorphic loci on the second homologous chromosome segment, and (iii)obtaining measured genetic allelic data comprising the amount of eachallele at each of the loci in the set of polymorphic loci in a mixedsample of DNA or RNA from the mother of the fetus that includes fetalDNA or RNA and maternal DNA or RNA (such as a mixed sample of cell-freeDNA or RNA originating from a blood sample from the mother that includesfetal cell-free DNA or RNA and maternal cell-free DNA or RNA). In someembodiments, the method involves calculating allele ratios for one ormore loci in the set of polymorphic loci that are heterozygous in thefetus and/or heterozygous in the mother. In some embodiments, thecalculated allele ratio for a particular locus is the measured quantityof one of the alleles divided by the total measured quantity of all thealleles for the locus. In some embodiments, the method involvesdetermining if there is an overrepresentation of the number of copies ofthe first homologous chromosome segment by comparing one or morecalculated allele ratios for a locus to an expected allele ratio, suchas a ratio that is expected for that locus if the first and secondhomologous chromosome segments are present in equal proportions.

In some embodiments, a calculated allele ratio is indicative of anoverrepresentation of the number of copies of the first homologouschromosome segment if either (i) the allele ratio for the measuredquantity of the allele present at that locus on the first homologouschromosome divided by the total measured quantity of all the alleles forthe locus is greater than the expected allele ratio for that locus, or(ii) the allele ratio for the measured quantity of the allele present atthat locus on the second homologous chromosome divided by the totalmeasured quantity of all the alleles for the locus is less than theexpected allele ratio for that locus. In some embodiments, a calculatedallele ratio is indicative of no overrepresentation of the number ofcopies of the first homologous chromosome segment if either (i) theallele ratio for the measured quantity of the allele present at thatlocus on the first homologous chromosome divided by the total measuredquantity of all the alleles for the locus is less than or equal to theexpected allele ratio for that locus, or (ii) the allele ratio for themeasured quantity of the allele present at that locus on the secondhomologous chromosome divided by the total measured quantity of all thealleles for the locus is greater than or equal to the expected alleleratio for that locus.

In some embodiments, determining if there is an overrepresentation ofthe number of copies of the first homologous chromosome segment includesenumerating a set of one or more hypotheses specifying the degree ofoverrepresentation of the first homologous chromosome segment. In someembodiments, predicted allele ratios for the loci that are heterozygousin at least one cell (such as the loci that are heterozygous in thefetus and/or heterozygous in the mother) are estimated for eachhypothesis given the degree of overrepresentation specified by thathypothesis. In some embodiments, the likelihood that the hypothesis iscorrect is calculated by comparing the calculated allele ratios to thepredicted allele ratios, and the hypothesis with the greatest likelihoodis selected. In some embodiments, an expected distribution of a teststatistic is calculated using the predicted allele ratios for eachhypothesis. In some embodiments, the likelihood that the hypothesis iscorrect is calculated by comparing a test statistic that is calculatedusing the calculated allele ratios to the expected distribution of thetest statistic that is calculated using the predicted allele ratios, andthe hypothesis with the greatest likelihood is selected. In someembodiments, predicted allele ratios for the loci that are heterozygousin at least one cell (such as the loci that are heterozygous in thefetus and/or heterozygous in the mother) are estimated given the phasedgenetic data for the first homologous chromosome segment, the phasedgenetic data for the second homologous chromosome segment, and thedegree of overrepresentation specified by that hypothesis. In someembodiments, the likelihood that the hypothesis is correct is calculatedby comparing the calculated allele ratios to the predicted alleleratios; and the hypothesis with the greatest likelihood is selected.

In some embodiments, the ratio of DNA (or RNA) from one or more targetcells to the total DNA (or RNA) in the sample is calculated. Anexemplary ratio is the ratio of fetal DNA (or RNA) to the total DNA (orRNA) in the sample. In some embodiments, the ratio of fetal DNA to totalDNA in the sample is determined by measuring the amount of an allele atone or more loci in which the fetus has the allele and the mother doesnot have the allele. In some embodiments, the ratio of fetal DNA tototal DNA in the sample is determined by measuring the difference inmethylation between one or more maternal and fetal alleles. In someembodiments, a set of one or more hypotheses specifying the degree ofoverrepresentation of the first homologous chromosome segment areenumerated. In some embodiments, predicted allele ratios for the locithat are heterozygous in at least one cell (such as the loci that areheterozygous in the fetus and/or heterozygous in the mother) areestimated given the calculated ratio of DNA or RNA and the degree ofoverrepresentation specified by that hypothesis are estimated for eachhypothesis. In some embodiments, the likelihood that the hypothesis iscorrect is calculated by comparing the calculated allele ratios to thepredicted allele ratios, and the hypothesis with the greatest likelihoodis selected. In some embodiments, an expected distribution of a teststatistic calculated using the predicted allele ratios and thecalculated ratio of DNA or RNA is estimated for each hypothesis. In someembodiments, the likelihood that the hypothesis is correct is determinedby comparing a test statistic calculated using the calculated alleleratios and the calculated ratio of DNA or RNA to the expecteddistribution of the test statistic calculated using the predicted alleleratios and the calculated ratio of DNA or RNA, and the hypothesis withthe greatest likelihood is selected.

In some embodiments, the method includes enumerating a set of one ormore hypotheses specifying the degree of overrepresentation of the firsthomologous chromosome segment. In some embodiments, the method includesestimating, for each hypothesis, either (i) predicted allele ratios forthe loci that are heterozygous in at least one cell (such as the locithat are heterozygous in the fetus and/or heterozygous in the mother)given the degree of overrepresentation specified by that hypothesis or(ii) for one or more possible ratios of DNA or RNA (such as ratios offetal DNA or RNA to the total DNA or RNA in the sample), an expecteddistribution of a test statistic calculated using the predicted alleleratios and the possible ratio of DNA or RNA from the one or more targetcells (such as fetal cells) to the total DNA or RNA in the sample. Insome embodiments, a data fit is calculated by comparing either (i) thecalculated allele ratios to the predicted allele ratios, or (ii) a teststatistic calculated using the calculated allele ratios and the possibleratio of DNA or RNA to the expected distribution of the test statisticcalculated using the predicted allele ratios and the possible ratio ofDNA or RNA. In some embodiments, one or more of the hypotheses areranked according to the data fit, and the hypothesis that is ranked thehighest is selected. In some embodiments, a technique or algorithm, suchas a search algorithm, is used for one or more of the following steps:calculating the data fit, ranking the hypotheses, or selecting thehypothesis that is ranked the highest. In some embodiments, the data fitis a fit to a beta-binomial distribution or a fit to a binomialdistribution. In some embodiments, the technique or algorithm isselected from the group consisting of maximum likelihood estimation,maximum a-posteriori estimation, Bayesian estimation, dynamic estimation(such as dynamic Bayesian estimation), and expectation-maximizationestimation. In some embodiments, the method includes applying thetechnique or algorithm to the obtained genetic data and the expectedgenetic data.

In some embodiments, the method includes creating a partition ofpossible ratios (such as ratios of fetal DNA or RNA to the total DNA orRNA in the sample) that range from a lower limit to an upper limit forthe ratio of DNA or RNA from the one or more target cells to the totalDNA or RNA in the sample. In some embodiments, a set of one or morehypotheses specifying the degree of overrepresentation of the firsthomologous chromosome segment are enumerated. In some embodiments, themethod includes estimating, for each of the possible ratios of DNA orRNA in the partition and for each hypothesis, either (i) predictedallele ratios for the loci that are heterozygous in at least one cell(such as the loci that are heterozygous in the fetus and/or heterozygousin the mother) given the possible ratio of DNA or RNA and the degree ofoverrepresentation specified by that hypothesis or (ii) an expecteddistribution of a test statistic calculated using the predicted alleleratios and the possible ratio of DNA or RNA. In some embodiments, themethod includes calculating, for each of the possible ratios of DNA orRNA in the partition and for each hypothesis, the likelihood that thehypothesis is correct by comparing either (i) the calculated alleleratios to the predicted allele ratios, or (ii) a test statisticcalculated using the calculated allele ratios and the possible ratio ofDNA or RNA to the expected distribution of the test statistic calculatedusing the predicted allele ratios and the possible ratio of DNA or RNA.In some embodiments, the combined probability for each hypothesis isdetermined by combining the probabilities of that hypothesis for each ofthe possible ratios in the partition; and the hypothesis with thegreatest combined probability is selected. In some embodiments, thecombined probability for each hypothesis is determining by weighting theprobability of a hypothesis for a particular possible ratio based on thelikelihood that the possible ratio is the correct ratio.

In one aspect, the invention features a method for determining a numberof copies of a chromosome or chromosome segment in the genome of one ormore cells from an individual using phased or unphased genetic data. Insome embodiments, the method involves obtaining genetic data at a set ofpolymorphic loci on the chromosome or chromosome segment in a sample bymeasuring the quantity of each allele at each locus. In someembodiments, the sample is a sample of DNA or RNA from one or more cellsfrom the individual or a mixed sample of cell-free DNA from theindividual that includes cell-free DNA from two or more geneticallydifferent cells. In some embodiments, allele ratios are calculated forthe loci that are heterozygous in at least one cell from which thesample was derived. In some embodiments, the calculated allele ratio fora particular locus is the measured quantity of one of the allelesdivided by the total measured quantity of all the alleles for the locus.In some embodiments, the calculated allele ratio for a particular locusis the measured quantity of one of the alleles (such as the allele onthe first homologous chromosome segment) divided by the measuredquantity of one or more other alleles (such as the allele on the secondhomologous chromosome segment) for the locus. In some embodiments, a setof one or more hypotheses specifying the number of copies of thechromosome or chromosome segment in the genome of one or more of thecells are enumerated. In some embodiments, the hypothesis that is mostlikely based on the test statistic is selected, thereby determining thenumber of copies of the chromosome or chromosome segment in the genomeof one or more of the cells.

In one aspect, the invention features a method for determining a numberof copies of a chromosome or chromosome segment in the genome of a fetus(such as a fetus that is gestating in a pregnant mother) using phased orunphased genetic data. In some embodiments, the method involvesobtaining genetic data at a set of polymorphic loci on the chromosome orchromosome segment in a sample by measuring the quantity of each alleleat each locus. In some embodiments, the sample is a mixed sample of DNAcomprising fetal DNA or RNA and maternal DNA or RNA from the mother ofthe fetus (such as a mixed sample of cell-free DNA or RNA originatingfrom a blood sample from the mother that includes fetal cell-free DNA orRNA and maternal cell-free DNA or RNA). In some embodiments, alleleratios are calculated for the loci that are heterozygous in the fetusand/or heterozygous in the mother. In some embodiments, the calculatedallele ratio for a particular locus is the measured quantity of one ofthe alleles divided by the total measured quantity of all the allelesfor the locus. In some embodiments, the calculated allele ratio for aparticular locus is the measured quantity of one of the alleles (such asthe allele on the first homologous chromosome segment) divided by themeasured quantity of one or more other alleles (such as the allele onthe second homologous chromosome segment) for the locus. In someembodiments, a set of one or more hypotheses specifying the number ofcopies of the chromosome or chromosome segment in the genome of fetusare enumerated. In some embodiments, the hypothesis that is most likelybased on the test statistic is selected, thereby determining the numberof copies of the chromosome or chromosome segment in the genome of thefetus.

In some embodiments, a hypotheses is selected if the probability thatthe test statistic belongs to a distribution of the test statistic forthat hypothesis is above an upper threshold; one or more of thehypotheses is rejected if the probability that the test statisticbelongs to the distribution of the test statistic for that hypothesis isbelow an lower threshold; or a hypothesis is neither selected norrejected if the probability that the test statistic belongs to thedistribution of the test statistic for that hypothesis is between thelower threshold and the upper threshold, or if the probability is notdetermined with sufficiently high confidence. In some embodiments, theoverrepresentation of the number of copies of the first homologouschromosome segment is due to a duplication of the first homologouschromosome segment or a deletion of the second homologous chromosomesegment. In some embodiments, the total measured quantity of all thealleles for one or more of the loci is compared to a reference amount todetermine whether the overrepresentation of the number of copies of thefirst homologous chromosome segment is due to a duplication of the firsthomologous chromosome segment or a deletion of the second homologouschromosome segment. In some embodiments, the magnitude of the differencebetween the calculated allele ratio and the expected allele ratio forone or more loci is used to determine whether the overrepresentation ofthe number of copies of the first homologous chromosome segment is dueto a duplication of the first homologous chromosome segment or adeletion of the second homologous chromosome segment. In someembodiments, the first and second homologous chromosome segments aredetermined to be present in equal proportions if there is not anoverrepresentation of the number of copies of the first homologouschromosome segment, and there is not an overrepresentation of the secondhomologous chromosome segment (such as in the genome of the cells,cfDNA, cfRNA, individual, fetus, or embryo).

In some embodiments, the ratio of DNA from the one or more target cellsto the total DNA in the sample is determined based on the total orrelative amount of one or more alleles at one or more loci for which thegenotype of the target cells differs from the genotype of the non-targetcells and for which the target cells and non-target cells are expectedto be disomic. In some embodiments, this ratio is used to determinewhether the overrepresentation of the number of copies of the firsthomologous chromosome segment is due to a duplication of the firsthomologous chromosome segment or a deletion of the second homologouschromosome segment. In some embodiments, the ratio is used to determinethe number of extra copies of a chromosome segment or chromosome that isduplicated. In some embodiments, the phased genetic data includesprobabilistic data. In some embodiments, obtaining the phased geneticdata for the first homologous chromosome segment and/or the secondhomologous chromosome segment in the genome of the fetus includesobtaining phased genetic data for the first homologous chromosomesegment and/or the second homologous chromosome segment in the genome ofone or both biological parents of the fetus, and inferring whichhomologous chromosome segment the fetus inherited from one or bothbiological parents. In some embodiments, the probability of one or morecrossovers (such as 1, 2, 3, or 4 crossovers) that may have occurredduring the formation of a gamete that contributed a copy of the firsthomologous chromosome segment or the second homologous chromosomesegment to the fetus individual is used to infer which homologouschromosome segment(s) the fetus inherited from one or both biologicalparents. In some embodiments, phased genetic data for the mother and/orfather of the fetus is obtained using a technique selected from thegroup consisting of digital PCR, inferring a haplotype using populationbased haplotype frequencies, haplotyping using a haploid cell such as asperm or egg, haplotyping using genetic data from one or more firstdegree relatives, and combinations thereof. In some embodiments, thephased genetic data for the individual is obtained by phasing a portionor all of region corresponding to a deletion or duplication in a samplefrom the individual. In some embodiments, the phased genetic data for afetus is obtained by phasing a portion or all of region corresponding toa deletion or duplication in a sample from the fetus or the mother ofthe fetus. In some embodiments, obtaining phased genetic data for thefirst and second homologous chromosome segments includes determining theidentity of alleles present in one of the chromosome segments anddetermining the identity of alleles present in the other chromosomesegment by inference. In some embodiments, alleles from unphased geneticdata that are not present in the first homologous chromosome segment areassigned to the second homologous chromosome segment. For example, ifthe genotype of the individual is (AB, AB) and the phased data for theindividual indicates that the first haplotype is (A,A); then, the otherhaplotype can be inferred to be (B,B). In some embodiments, if only oneallele is measured at a locus then that allele is determined to be partof both the first and second homologous chromosome segments (e.g., ifthe genotype is AA at a locus than both haplotypes have the A allele).In some embodiments, the phased genetic data for the individualcomprises determining whether or not one or more possible chromosomecrossovers occurred, such as by determining the sequence of arecombination hotspot and optionally of a region flanking arecombination hotspot. In some embodiments, any of the primer librariesof the invention are used to detect a recombination event to determinewhat haplotype blocks are present in the genome of an individual.

In some embodiments, the method includes using a joint distributionmodel (such as a joint distribution model that takes into account thelinkage between loci), performing a linkage analysis, using a binomialdistribution model, using a beta-binomial distribution model, and/orusing the likelihood of crossovers having occurred during the meiosisthat gave rise to the gametes that formed the embryo that grew into thefetus (such as using the probability of chromosomes crossing over atdifferent locations in a chromosome to model dependence betweenpolymorphic alleles on the chromosome or chromosome segment ofinterest).

In some embodiments, one or more of the calculated allele ratios for thecfDNA or cfRNA are indicative of the corresponding allele ratios for DNAor RNA in the cells from which the cfDNA or cfRNA was derived. In someembodiments, one or more of the calculated allele ratios for the cfDNAor cfRNA are indicative of the corresponding allele ratios in the genomeof the individual. In some embodiments, an allele ratio is onlycalculated or is only compared to an expected allele ratio if themeasured genetic data indicate that more than one different allele ispresent for that locus in the sample (such as in a cfDNA or cfRNAsample). In some embodiments, an allele ratio is only calculated or isonly compared to an expected allele ratio if the locus is heterozygousin at least one of the cells from which the sample was derived (such asa locus that is heterozygous in the fetus and/or heterozygous in themother). In some embodiments, an allele ratio is only calculated or isonly compared to an expected allele ratio if the locus is heterozygousin the fetus. In some embodiments, an allele ratio is calculated andcompared to an expected allele ratio for a homozygous locus. Forexample, allele ratios for loci that are predicted to be homozygous fora particular individual being tested (or for both a fetus and pregnantmother) may be analyzed to determine the level of noise or error in thesystem.

In some embodiments, at least 10; 50; 100; 200; 300; 500; 750; 1,000;2,000; 3,000; 4,000, or more loci (such as SNPs) are analyzed for achromosome or chromosome segment of interest. In some embodiments, theaverage number of loci (such as SNPs) per mb in a chromosome orchromosome segment of interest is at least 1; 10; 25; 50; 100; 150; 200;300; 500; 750; 1,000; or more loci per mb. In some embodiments, theaverage number of loci (such as SNPs) per mb in a chromosome orchromosome segment of interest is between 1 and 500 loci per mb, such asbetween 1 and 50, 50 and 100, 100 and 200, 200 and 400, 200 and 300, or300 and 400 loci per mb, inclusive. In some embodiments, loci inmultiple portions of a potential deletion or duplication are analyzed toincrease the sensitivity and/or specificity of the CNV determinationcompared to only analyzing 1 loci or only analyzing a few loci that arenear each other. In some embodiments, only the two most common allelesat each locus are measured or are used to determine the calculatedallele ratio. In some embodiments, the amplification of loci isperformed using a polymerase (e.g., a DNA polymerase, RNA polymerase, orreverse transcriptase) with low 5′→3′ exonuclease and/or low stranddisplacement activity. In some embodiments, the measured genetic allelicdata is obtained by (i) sequencing the DNA or RNA in the sample, (ii)amplifying DNA or RNA in the sample and then sequencing the amplifiedDNA, or (ii) amplifying the DNA or RNA in the sample, ligating PCRproducts, and then sequencing the ligated products. In some embodiments,measured genetic allelic data is obtained by dividing the DNA or RNAfrom the sample into a plurality of fractions, adding a differentbarcode to the DNA or RNA in each fraction (e.g., such that all the DNAor RNA in a particular fraction has the same barcode), optionallyamplifying the barcoded DNA or RNA, combining the fractions, and thensequencing the barcoded DNA or RNA in the combined fractions. In someembodiments, alleles of the polymorphic loci (such as SNPs) areidentified using one or more of the following methods: sequencing (suchas nanopore sequencing or Halcyon Molecular sequencing), SNP array, realtime PCR, TaqMan, Nanostring nCounter® Analysis System, IlluminaGoldenGate Genotyping Assay that uses a discriminatory DNA polymeraseand ligase, ligation-mediated PCR, or Linked Inverted Probes (LIPs;which can also be called pre-circularized probes, pre-circularizingprobes, circularizing probes, Padlock Probes, or Molecular InversionProbes (MIPs)). In some embodiments, two or more (such as 3 or 4) targetamplicons are ligated together and then the ligated products aresequenced. In some embodiments, measurements for different alleles forthe same locus are adjusted for differences in metabolism, apoptosis,histones, inactivation, and/or amplification between the alleles (suchas differences in amplification efficiency between different alleles ofthe same locus). In some embodiments, this adjustment is performed priorto calculating allele ratios for the obtained genetic data or prior tocomparing the measured genetic data to the expected genetic data.

In some embodiments, the method also includes determining the presenceor absence of one or more risk factors for a disease or disorder. Insome embodiments, the method also includes determining the presence orabsence of one or more polymorphisms or mutations associated with thedisease or disorder or an increased risk for a disease or disorder. Insome embodiments, the method also includes determining the total levelof cfDNA cf mDNA, cf nDNA, cfRNA, miRNA, or any combination thereof. Insome embodiments, the method includes determining the level of one ormore cfDNA cf mDNA, cf nDNA, cfRNA, and/or miRNA molecules of interest,such as molecules with a polymorphism or mutation associated with adisease or disorder or an increased risk for a disease or disorder. Insome embodiments, the fraction of tumor DNA out of total DNA (such asthe fraction of tumor cfDNA out of total cfDNA or the fraction of tumorcfDNA with a particular mutation out of total cfDNA) is determined. Insome embodiments, this tumor fraction is used to determine the stage ofa cancer (since higher tumor fractions can be associated with moreadvanced stages of cancer). In some embodiments, the method alsoincludes determining the total level of DNA or RNA level. In someembodiments, the method includes determining the methylation level ofone or more DNA or RNA molecules of interest, such as molecules with apolymorphism or mutation associated with a disease or disorder or anincreased risk for a disease or disorder. In some embodiments, themethod includes determining the presence or absence of a change in DNAintegrity. In some embodiments, the method also includes determining thetotal level of mRNA splicing. In some embodiments, the method includesdetermining the level of mRNA splicing or detecting alternative mRNAsplicing for one or RNA molecules of interest, such as molecules with apolymorphism or mutation associated with a disease or disorder or anincreased risk for a disease or disorder.

In some embodiments, the invention features a method for detecting acancer phenotype in an individual, wherein the cancer phenotype isdefined by the presence of at least one of a set of mutations. In someembodiments, the method includes obtaining DNA or RNA measurements for asample of DNA or RNA from one or more cells from the individual, whereinone or more of the cells is suspected of having the cancer phenotype;and analyzing the DNA or RNA measurements to determine, for each of themutations in the set of mutations, the likelihood that at least one ofthe cells has that mutation. In some embodiments, the method includesdetermining that the individual has the cancer phenotype if either (i)for at least one of the mutations, the likelihood that at least one ofthe cells contains that mutations is greater than a threshold, or (ii)for at least one of the mutations, the likelihood that at least one ofthe cells has that mutations is less than the threshold, and for aplurality of the mutations, the combined likelihood that at least one ofthe cells has at least one of the mutations is greater than thethreshold. In some embodiments, one or more cells have a subset or allof the mutations in the set of mutations. In some embodiments, thesubset of mutations is associated with cancer or an increased risk forcancer. In some embodiments, the sample includes cell-free DNA or RNA.In some embodiments, the DNA or RNA measurements include measurements(such as the quantity of each allele at each locus) at a set ofpolymorphic loci on one or more chromosomes or chromosome segments ofinterest.

In one aspect, the invention features methods for selecting a therapyfor the treatment, stabilization, or prevention of a disease or disorderin a mammal. In some embodiments, the method includes determining ifthere is an overrepresentation of the number of copies of a firsthomologous chromosome segment as compared to a second homologouschromosome segment using any of the methods described herein. In someembodiments, a therapy is selected for the mammal (such as a therapy fora disease or disorder associated with the overrepresentation of thefirst homologous chromosome segment).

In one aspect, the invention features methods for preventing, delaying,stabilizing, or treating a disease or disorder in a mammal. In someembodiments, the method includes determining if there is anoverrepresentation of the number of copies of a first homologouschromosome segment as compared to a second homologous chromosome segmentusing any of the methods described herein. In some embodiments, atherapy is selected for the mammal (such as a therapy for a disease ordisorder associated with the overrepresentation of the first homologouschromosome segment) and then the therapy is administered to the mammal.

In some embodiments, treating, stabilizing, or preventing a disease ordisorder includes preventing or delaying an initial or subsequentoccurrence of a disease or disorder, increasing the disease-freesurvival time between the disappearance of a condition and itsreoccurrence, stabilizing or reducing an adverse symptom associated witha condition, or inhibiting or stabilizing the progression of acondition. In some embodiments, at least 20, 40, 60, 80, 90, or 95% ofthe treated subjects have a complete remission in which all evidence ofthe condition disappears. In some embodiments, the length of time asubject survives after being diagnosed with a condition and treated isat least 20, 40, 60, 80, 100, 200, or even 500% greater than (i) theaverage amount of time an untreated subject survives or (ii) the averageamount of time a subject treated with another therapy survives.

In some embodiments, treating, stabilizing, or preventing cancerincludes reducing or stabilizing the size of a tumor (e.g., a benign ormalignant tumor), slowing or preventing an increase in the size of atumor, reducing or stabilizing the number of tumor cells, increasing thedisease-free survival time between the disappearance of a tumor and itsreappearance, preventing an initial or subsequent occurrence of a tumor,or reducing or stabilizing an adverse symptom associated with a tumor.In one embodiment, the number of cancerous cells surviving the treatmentis at least 10, 20, 40, 60, 80, or 100% lower than the initial number ofcancerous cells, as measured using any standard assay. In someembodiments, the decrease in the number of cancerous cells induced byadministration of a therapy of the invention is at least 2, 5, 10, 20,or 50-fold greater than the decrease in the number of non-cancerouscells. In some embodiments, the number of cancerous cells present afteradministration of a therapy is at least 2, 5, 10, 20, or 50-fold lowerthan the number of cancerous cells present after administration of acontrol (such as administration of saline or a buffer). In someembodiments, the methods of the present invention result in a decreaseof 10, 20, 40, 60, 80, or 100% in the size of a tumor as determinedusing standard methods. In some embodiments, at least 10, 20, 40, 60,80, 90, or 95% of the treated subjects have a complete remission inwhich there are no detectable cancerous cells. In some embodiments, thecancer does not reappear, or reappears after at least 2, 5, 10, 15, or20 years. In some embodiments, the length of time a subject survivesafter being diagnosed with cancer and treated with a therapy of theinvention is at least 10, 20, 40, 60, 80, 100, 200, or even 500% greaterthan (i) the average amount of time an untreated subject survives or(ii) the average amount of time a subject treated with another therapysurvives.

In one aspect, the invention features methods for stratification ofsubjects involved in a clinical trial for the treatment, stabilization,or prevention of a disease or disorder in a mammal. In some embodiments,the method includes determining if there is an overrepresentation of thenumber of copies of a first homologous chromosome segment as compared toa second homologous chromosome segment using any of the methodsdescribed herein before, during, or after the clinical trial. In someembodiments, the presence or absence of the overrepresentation of thefirst homologous chromosome segment in the genome of the subject placesthe subject into a subgroup for the clinical trial.

In some embodiments, the disease or disorder is selected from the groupconsisting of cancer, mental handicap, learning disability (e.g.,idiopathic learning disability), mental retardation, developmentaldelay, autism, neurodegenerative disease or disorder, schizophrenia,physical handicap, autoimmune disease or disorder, systemic lupuserythematosus, psoriasis, Crohn's disease, glomerulonephritis, HIVinfection, AIDS, and combinations thereof. In some embodiments, thedisease or disorder is selected from the group consisting of DiGeorgesyndrome, DiGeorge 2 syndrome, DiGeorge/VCFS syndrome, Prader-Willisyndrome, Angelman syndrome, Beckwith-Wiedemann syndrome, 1p36 deletionsyndrome, 2q37 deletion syndrome, 3q29 deletion syndrome, 9q34 deletionsyndrome, 17q21.31 deletion syndrome, Cri-du-chat syndrome, Jacobsensyndrome, Miller Dieker syndrome, Phelan-McDermid syndrome,Smith-Magenis syndrome, WAGR syndrome, Wolf-Hirschhorn syndrome,Williams syndrome, Williams-Beuren syndrome, Miller-Dieker syndrome,Phelan-McDermid syndrome, Smith-Magenis syndrome, Down syndrome, Edwardsyndrome, Patau syndrome, Klinefelter syndrome, Turner syndrome, 47,XXXsyndrome, 47,XYY syndrome, Sotos syndrome, and combinations thereof. Insome embodiments, the method determines the presence or absence of oneor more of the following chromosomal abnormalities: nullsomy, monosomy,uniparental disomy, trisomy, matched trisomy, unmatched trisomy,maternal trisomy, paternal trisomy, triploidy, mosaicism tetrasomy,matched tetrasomy, unmatched tetrasomy, other aneuploidies, unbalancedtranslocations, balanced translocations, insertions, deletions,recombinations, and combinations thereof. In some embodiments, thechromosomal abnormality is any deviation in the copy number of aspecific chromosome or chromosome segment from the most common number ofcopies of that segment or chromosome, for example in a human somaticcell, any deviation from 2 copies can be regarded as a chromosomalabnormality. In some embodiments, the method determines the presence orabsence of a euploidy. In some embodiments, the copy number hypothesesinclude one or more copy number hypotheses for a singleton pregnancy. Insome embodiments, the copy number hypotheses include one or more copynumber hypotheses for a multiple pregnancy, such as a twin pregnancy(e.g., identical or fraternal twins or a vanishing twin). In someembodiments, the copy number hypotheses include all fetuses in amultiple pregnancy being euploid, all fetuses in a multiple pregnancybeing aneuploid (such as any of the aneuploidies disclosed herein),and/or one or more fetuses in a multiple pregnancy being euploid and oneor more fetuses in a multiple pregnancy being aneuploidy. In someembodiments, the copy number hypotheses include identical twins (alsoreferred to as monozygotic twins) or fraternal twins (also referred toas dizygotic twins). In some embodiments, the copy number hypothesesinclude a molar pregnancy, such as a complete or partial molarpregnancy. In some embodiments, the chromosome segment of interest is anentire chromosome. In some embodiments, the chromosome or chromosomesegment is selected from the group consisting of chromosome 13,chromosome 18, chromosome 21, the X chromosome, the Y chromosome,segments thereof, and combinations thereof. In some embodiments, thefirst homologous chromosome segment and second homologous chromosomesegment are a pair of homologous chromosome segments that comprises thechromosome segment of interest. In some embodiments, the firsthomologous chromosome segment and second homologous chromosome segmentare a pair of homologous chromosomes of interest. In some embodiments, aconfidence is computed for the CNV determination or the diagnosis of thedisease or disorder.

In some embodiments, the deletion is a deletion of at least 0.01 kb, 0.1kb, 1 kb, 10 kb, 100 kb, 1 mb, 2 mb, 3 mb, 5 mb, 10 mb, 15 mb, 20 mb, 30mb, or 40 mb. In some embodiments, the deletion is a deletion of between1 kb to 40 mb, such as between 1 kb to 100 kb, 100 kb to 1 mb, 1 to 5mb, 5 to 10 mb, 10 to 15 mb, 15 to 20 mb, 20 to 25 mb, 25 to 30 mb, or30 to 40 mb, inclusive. In some embodiments, one copy of the chromosomesegment is deleted and one copy is present. In some embodiments, twocopies of the chromosome segment are deleted. In some embodiments, anentire chromosome is deleted.

In some embodiments, the duplication is a duplication of at least 0.01kb, 0.1 kb, 1 kb, 10 kb, 100 kb, 1 mb, 2 mb, 3 mb, 5 mb, 10 mb, 15 mb,20 mb, 30 mb, or 40 mb. In some embodiments, the duplication is aduplication of between 1 kb to 40 mb, such as between 1 kb to 100 kb,100 kb to 1 mb, 1 to 5 mb, 5 to 10 mb, 10 to 15 mb, 15 to 20 mb, 20 to25 mb, 25 to 30 mb, or 30 to 40 mb, inclusive. In some embodiments, thechromosome segment is duplicated one time. In some embodiments, thechromosome segment is duplicated more than one time, such as 2, 3, 4, or5 times. In some embodiments, an entire chromosome is duplicated. Insome embodiments, a region in a first homologous segment is deleted, andthe same region or another region in the second homologous segment isduplicated. In some embodiments, at least 50, 60, 70, 80, 90, 95, 96,98, 99, or 100% of the SNVs tested for are transversion mutations ratherthan transition mutations.

In some embodiments, the sample comprises DNA and/or RNA from (i) one ormore target cells or (ii) one or more non-target cells. In someembodiments, the sample is a mixed sample with DNA and/or RNA from oneor more target cells and one or more non-target cells. In someembodiments, the target cells are cells that have a CNV, such as adeletion or duplication of interest, and the non-target cells are cellsthat do not have the copy number variation of interest. In someembodiments in which the one or more target cells are cancer cell(s) andthe one or more non-target cells are non-cancerous cell(s), the methodincludes determining if there is an overrepresentation of the number ofcopies of the first homologous chromosome segment in the genome of oneor more of the cancer cells. In some embodiments in which the one ormore target cells are genetically identical cancer cell(s) and the oneor more non-target cells are non-cancerous cell(s), the method includesdetermining if there is an overrepresentation of the number of copies ofthe first homologous chromosome segment in the genome of the cancercell(s). In some embodiments in which the one or more target cells aregenetically non-identical cancer cell(s) and the one or more non-targetcells are non-cancerous cell(s), the method includes determining ifthere is an overrepresentation of the number of copies of the firsthomologous chromosome segment in the genome of one or more of thegenetically non-identical cancer cells. In some embodiments in which thesample comprises cell-free DNA from a mixture of one or more cancercells and one or more non-cancerous cells, the method includesdetermining if there is an overrepresentation of the number of copies ofthe first homologous chromosome segment in the genome of one or more ofthe cancer cells. In some embodiments in which the one or more targetcells are genetically identical fetal cell(s) and the one or morenon-target cells are maternal cell(s), the method includes determiningif there is an overrepresentation of the number of copies of the firsthomologous chromosome segment in the genome of the fetal cell(s). Insome embodiments in which the one or more target cells are geneticallynon-identical fetal cell(s) and the one or more non-target cells arematernal cell(s), the method includes determining if there is anoverrepresentation of the number of copies of the first homologouschromosome segment in the genome of one or more of the geneticallynon-identical fetal cells. As the cells of most individuals contain anearly identical set of nuclear DNA, the term “target cell” may be usedinterchangeably with the term “individual” in some embodiments.Cancerous cells have genotypes that are distinct from the hostindividual. In this case, the cancer itself may be considered anindividual. Moreover, many cancers are heterogeneous meaning thatdifferent cells in a tumor are genetically distinct from other cells inthe same tumor. In this case, the different genetically identicalregions can be considered different individuals. Alternately, the cancermay be considered a single individual with a mixture of cells withdistinct genomes. Typically, non-target cells are euploid, though thisis not necessarily the case.

In some embodiments, the sample is obtained from a maternal whole bloodsample or fraction thereof, cells isolated from a maternal blood sample,an amniocentesis sample, a products of conception sample, a placentaltissue sample, a chorionic villus sample, a placental membrane sample, acervical mucus sample, or a sample from a fetus. In some embodiments,the sample comprises cell-free DNA obtained from a blood sample orfraction thereof from the mother. In some embodiments, the samplecomprises nuclear DNA obtained from a mixture of fetal cells andmaternal cells. In some embodiments, the sample is obtained from afraction of maternal blood containing nucleated cells that has beenenriched for fetal cells. In some embodiments, a sample is divided intomultiple fractions (such as 2, 3, 4 5, or more fractions) that are eachanalyzed using a method of the invention. If each fraction produces thesame results (such as the presence or absence of one or more CNVs ofinterest), the confidence in the results increases. In differentfractions produce different results, the sample could be re-analyzed oranother sample could be collected from the same subject and analyzed.

Exemplary subjects include mammals, such as humans and mammals ofveterinary interest. In some embodiments, the mammal is a primate (e.g.,a human, a monkey, a gorilla, an ape, a lemur, etc.), a bovine, anequine, a porcine, a canine, or a feline.

In some embodiments, any of the methods include generating a report(such as a written or electronic report) disclosing a result of themethod of the invention (such as the presence or absence of a deletionor duplication).

In some embodiments, any of the methods include taking a clinical actionbased on a result of a method of the invention (such as the presence orabsence of a deletion or duplication). In some embodiments in which anembryo or fetus has one or more polymorphisms or mutations of interest(such as a CNV) based on a result of a method of the invention, theclinical action includes performing additional testing (such as testingto confirm the presence of the polymorphism or mutation), not implantingthe embryo for IVF, implanting a different embryo for IVF, terminating apregnancy, preparing for a special needs child, or undergoing anintervention designed to decrease the severity of the phenotypicpresentation of a genetic disorder. In some embodiments, the clinicalaction is selected from the group consisting of performing anultrasound, amniocentesis on the fetus, amniocentesis on a subsequentfetus that inherits genetic material from the mother and/or father,chorion villus biopsy on the fetus, chorion villus biopsy on asubsequent fetus that inherits genetic material from the mother and/orfather, in vitro fertilization, preimplantation genetic diagnosis on oneor more embryos that inherited genetic material from the mother and/orfather, karyotyping on the mother, karyotyping on the father, fetalechocardiogram (such as an echocardiogram of a fetus with trisomy 21,18, or 13, monosomy X, or a microdeletion) and combinations thereof. Insome embodiments, the clinical action is selected from the groupconsisting of administering growth hormone to a born child with monosomyX (such as administration starting at ˜9 months), administering calciumto a born child with a 22q deletion (such as DiGeorge syndrome),administering an androgen such as testosterone to a born child with47,XXY (such as one injection per month for 3 months of 25 mgtestosterone enanthate to an infant or toddler), performing a test forcancer on a woman with a complete or partial molar pregnancy (such as atriploid fetus), administering a therapy for cancer such as achemotherapeutic agent to a woman with a complete or partial molarpregnancy (such as a triploid fetus), screening a fetus determined to bemale (such as a fetus determined to be male using a method of theinvention) for one or more X-linked genetic disorders such as Duchennemuscular dystrophy (DMD), adrenoleukodystrophy, or hemophilia,performing amniocentesis on a male fetus at risk for an X-linkeddisorder, administering dexamethasone to a women with a female fetus(such as a fetus determined to be female using a method of theinvention) at risk for congenital adrenal hyperplasia, performingamniocentesis on a female fetus at risk for congenital adrenalhyperplasia, administering killed vaccines (instead of live vaccines) ornot administering certain vaccines to a born child that is (or issuspected of being) immune deficient from a 22q11.2 deletion, performingoccupational and/or physical therapy, performing early intervention ineducation, delivering the baby at a tertiary care center with a NICUand/or having pediatric specialists available at delivery, behavioralintervention for born child (such as a child with XXX, XXY, or XYY), andcombinations thereof.

In some embodiments, ultrasound or another screening test is performedon a women determined to have multiple pregnancies (such as twins) todetermine whether or not two or more of the fetus are monochorionic.Monozygotic twins result from ovulation and fertilization of a singleoocyte, with subsequent division of the zygote; placentation may bedichorionic or monochorionic. Dizygotic twins occur from ovulation andfertilization of two oocytes, which usually results in dichorionicplacentation. Monochorionic twins have a risk of twin-to-twintransfusion syndrome, which may cause unequal distribution of bloodbetween fetuses that results in differences in their growth anddevelopment, sometimes resulting in stillbirth. Thus, twins determinedto be monozygotic twins using a method of the invention are desirablytested (such as by ultrasound) to determine if they are monochorionictwins, and if so, these twins can be monitored (such as bi-weeklyultrasounds from 16 weeks) for signs of win-to-twin transfusionsyndrome.

In some embodiments in which an embryo or fetus does not have one ormore one or more polymorphisms or mutations of interest (such as a CNV)based on a result of a method of the invention, the clinical actionincludes implanting the embryo for IVF or continuing a pregnancy. Insome embodiments, the clinical action is additional testing to confirmthe absence of the polymorphism or mutation selected from the groupconsisting of performing an ultrasound, amniocentesis, chorion villusbiopsy, and combinations thereof.

In some embodiments in which an individual has one or more one or morepolymorphisms or mutations (such as a polymorphism or mutationassociated with a disease or disorder such as cancer or an increasedrisk for a disease or disorder such as cancer) based on a result of amethod of the invention, the clinical action includes performingadditional testing or administering one or more therapies for a diseaseor disorder (such as a therapy for cancer, a therapy for the specifictype of cancer or type of mutation the individual is diagnosed with, orany of the therapies disclosed herein). In some embodiments, theclinical action is additional testing to confirm the presence or absenceof a polymorphism or mutation selected from the group consisting ofbiopsy, surgery, medical imaging (such as a mammogram or an ultrasound),and combinations thereof.

In some embodiments, the additional testing includes performing the sameor a different method (such as any of the methods described herein) toconfirm the presence or absence of the polymorphism or mutation (such asa CNV), such as testing either a second fraction of the same sample thatwas tested or a different sample from the same individual (such as thesame pregnant mother, fetus, embryo, or individual at increased risk forcancer). In some embodiments, the additional testing is performed for anindividual for whom the probability of a polymorphism or mutation (suchas a CNV) is above a threshold value (such as additional testing toconfirm the presence of a likely polymorphism or mutation). In someembodiments, the additional testing is performed for an individual forwhom the confidence or z-score for the determination of a polymorphismor mutation (such as a CNV) is above a threshold value (such asadditional testing to confirm the presence of a likely polymorphism ormutation). In some embodiments, the additional testing is performed foran individual for whom the confidence or z-score for the determinationof a polymorphism or mutation (such as a CNV) is between minimum andmaximum threshold values (such as additional testing to increase theconfidence that the initial result is correct). In some embodiments, theadditional testing is performed for an individual for whom theconfidence for the determination of the presence or absence of apolymorphism or mutation (such as a CNV) is below a threshold value(such as a “no call” result due to not being able to determine thepresence or absence of the CNV with sufficient confidence). An exemplaryZ core is calculated in Chiu et al. BMJ 2011; 342:c7401 (which is herebyincorporated by reference in its entirety) in which chromosome 21 isused as an example and can be replaced with any other chromosome orchromosome segment in the test sample.

Z score for percentage chromosome 21 in test case=((percentagechromosome 21 in test case)−(mean percentage chromosome 21 in referencecontrols))/(standard deviation of percentage chromosome 21 in referencecontrols).

In some embodiments, the additional testing is performed for anindividual for whom the initial sample did not meet quality controlguidelines or had a fetal fraction or a tumor fraction below a thresholdvalue. In some embodiments, the method includes selecting an individualfor additional testing based on the result of a method of the invention,the probability of the result, the confidence of the result, or thez-score; and performing the additional testing on the individual (suchas on the same or a different sample). In some embodiments, a subjectdiagnosed with a disease or disorder (such as cancer) undergoes repeattesting using a method of the invention or known testing for the diseaseor disorder at multiple time points to monitor the progression of thedisease or disorder or the remission or reoccurrence of the disease ordisorder.

In one aspect, the invention features a report (such as a written orelectronic report) with a result from a method of the invention (such asthe presence or absence of a deletion or duplication).

In various embodiments, the primer extension reaction or the polymerasechain reaction includes the addition of one or more nucleotides by apolymerase. In some embodiments, the primers are in solution. In someembodiments, the primers are in solution and are not immobilized on asolid support. In some embodiments, the primers are not part of amicroarray. In various embodiments, the primer extension reaction or thepolymerase chain reaction does not include ligation-mediated PCR. Invarious embodiments, the primer extension reaction or the polymerasechain reaction does not include the joining of two primers by a ligase.In various embodiments, the primers do not include Linked InvertedProbes (LIPs), which can also be called pre-circularized probes,pre-circularizing probes, circularizing probes, Padlock Probes, orMolecular Inversion Probes (MIPs).

It is understood that aspects and embodiments of the invention describedherein include combinations of any two or more of the aspects orembodiments of the invention.

Definitions

Single Nucleotide Polymorphism (SNP) refers to a single nucleotide thatmay differ between the genomes of two members of the same species. Theusage of the term should not imply any limit on the frequency with whicheach variant occurs.

Sequence refers to a DNA sequence or a genetic sequence. It may refer tothe primary, physical structure of the DNA molecule or strand in anindividual. It may refer to the sequence of nucleotides found in thatDNA molecule, or the complementary strand to the DNA molecule. It mayrefer to the information contained in the DNA molecule as itsrepresentation in silico.

Locus refers to a particular region of interest on the DNA of anindividual, which may refer to a SNP, the site of a possible insertionor deletion, or the site of some other relevant genetic variation.Disease-linked SNPs may also refer to disease-linked loci.

Polymorphic Allele, also “Polymorphic Locus,” refers to an allele orlocus where the genotype varies between individuals within a givenspecies. Some examples of polymorphic alleles include single nucleotidepolymorphisms, short tandem repeats, deletions, duplications, andinversions.

Polymorphic Site refers to the specific nucleotides found in apolymorphic region that vary between individuals.

Mutation refers to an alteration in a naturally-occurring or referencenucleic acid sequence, such as an insertion, deletion, duplication,translocation, substitution, frameshift mutation, silent mutation,nonsense mutation, missense mutation, point mutation, transitionmutation, transversion mutation, reverse mutation, or microsatellitealteration. In some embodiments, the amino acid sequence encoded by thenucleic acid sequence has at least one amino acid alteration from anaturally-occurring sequence.

Allele refers to the genes that occupy a particular locus.

Genetic Data also “Genotypic Data” refers to the data describing aspectsof the genome of one or more individuals. It may refer to one or a setof loci, partial or entire sequences, partial or entire chromosomes, orthe entire genome. It may refer to the identity of one or a plurality ofnucleotides; it may refer to a set of sequential nucleotides, ornucleotides from different locations in the genome, or a combinationthereof. Genotypic data is typically in silico, however, it is alsopossible to consider physical nucleotides in a sequence as chemicallyencoded genetic data. Genotypic Data may be said to be “on,” “of,” “at,”“from” or “on” the individual(s). Genotypic Data may refer to outputmeasurements from a genotyping platform where those measurements aremade on genetic material.

Genetic Material also “Genetic Sample” refers to physical matter, suchas tissue or blood, from one or more individuals comprising DNA or RNA.

Confidence refers to the statistical likelihood that the called SNP,allele, set of alleles, determined number of copies of a chromosome orchromosome segment, or diagnosis of the presence or absence of a diseasecorrectly represents the real genetic state of the individual.

Ploidy Calling, also “Chromosome Copy Number Calling,” or “Copy NumberCalling” (CNC), may refer to the act of determining the quantity and/orchromosomal identity of one or more chromosomes or chromosome segmentspresent in a cell.

Aneuploidy refers to the state where the wrong number of chromosomes(e.g., the wrong number of full chromosomes or the wrong number ofchromosome segments, such as the presence of deletions or duplicationsof a chromosome segment) is present in a cell. In the case of a somatichuman cell it may refer to the case where a cell does not contain 22pairs of autosomal chromosomes and one pair of sex chromosomes. In thecase of a human gamete, it may refer to the case where a cell does notcontain one of each of the 23 chromosomes. In the case of a singlechromosome type, it may refer to the case where more or less than twohomologous but non-identical chromosome copies are present, or wherethere are two chromosome copies present that originate from the sameparent. In some embodiments, the deletion of a chromosome segment is amicrodeletion.

Ploidy State refers to the quantity and/or chromosomal identity of oneor more chromosomes or chromosome segments in a cell.

Chromosome may refer to a single chromosome copy, meaning a singlemolecule of DNA of which there are 46 in a normal somatic cell; anexample is ‘the maternally derived chromosome 18’. Chromosome may alsorefer to a chromosome type, of which there are 23 in a normal humansomatic cell; an example is ‘chromosome 18’.

Chromosomal Identity may refer to the referent chromosome number, i.e.the chromosome type. Normal humans have 22 types of numbered autosomalchromosome types, and two types of sex chromosomes. It may also refer tothe parental origin of the chromosome. It may also refer to a specificchromosome inherited from the parent. It may also refer to otheridentifying features of a chromosome.

Allelic Data refers to a set of genotypic data concerning a set of oneor more alleles. It may refer to the phased, haplotypic data. It mayrefer to SNP identities, and it may refer to the sequence data of theDNA, including insertions, deletions, repeats and mutations. It mayinclude the parental origin of each allele.

Allelic State refers to the actual state of the genes in a set of one ormore alleles. It may refer to the actual state of the genes described bythe allelic data.

Allele Count refers to the number of sequences that map to a particularlocus, and if that locus is polymorphic, it refers to the number ofsequences that map to each of the alleles. If each allele is counted ina binary fashion, then the allele count will be whole number. If thealleles are counted probabilistically, then the allele count can be afractional number.

Allele Count Probability refers to the number of sequences that arelikely to map to a particular locus or a set of alleles at a polymorphiclocus, combined with the probability of the mapping. Note that allelecounts are equivalent to allele count probabilities where theprobability of the mapping for each counted sequence is binary (zero orone). In some embodiments, the allele count probabilities may be binary.In some embodiments, the allele count probabilities may be set to beequal to the DNA measurements.

Allelic Distribution, or “allele count distribution” refers to therelative amount of each allele that is present for each locus in a setof loci. An allelic distribution can refer to an individual, to asample, or to a set of measurements made on a sample. In the context ofdigital allele measurements such as sequencing, the allelic distributionrefers to the number or probable number of reads that map to aparticular allele for each allele in a set of polymorphic loci. In thecontext of analog allele measurements such as SNP arrays, the allelicdistribution refers to allele intensities and/or allele ratios. Theallele measurements may be treated probabilistically, that is, thelikelihood that a given allele is present for a give sequence read is afraction between 0 and 1, or they may be treated in a binary fashion,that is, any given read is considered to be exactly zero or one copiesof a particular allele.

Allelic Distribution Pattern refers to a set of different alleledistributions for different contexts, such as different parentalcontexts. Certain allelic distribution patterns may be indicative ofcertain ploidy states.

Allelic Bias refers to the degree to which the measured ratio of allelesat a heterozygous locus is different to the ratio that was present inthe original sample of DNA or RNA. The degree of allelic bias at aparticular locus is equal to the observed allelic ratio at that locus,as measured, divided by the ratio of alleles in the original DNA or RNAsample at that locus. Allelic bias maybe due to amplification bias,purification bias, or some other phenomenon that affects differentalleles differently.

Allelic imbalance refers for SNVs, to the proportion of abnormal DNA istypically measured using mutant allele frequency (number of mutantalleles at a locus/total number of alleles at that locus). Since thedifference between the amounts of two homologs in tumours is analogous,we measure the proportion of abnormal DNA for a CNV by the averageallelic imbalance (AAI), defined as |(H1−H2)|/(H1+H2), where Hi is theaverage number of copies of homolog i in the sample and Hi/(H1+H2) isthe fractional abundance, or homolog ratio, of homolog i. The maximumhomolog ratio is the homolog ratio of the more abundant homolog.

Assay drop-out rate is the percentage of SNPs with no reads, estimatedusing all SNPs.

Single allele drop-out (ADO) rate is the percentage of SNPs with onlyone allele present, estimated using only heterozygous SNPs.

Primer, also “PCR probe” refers to a single nucleic acid molecule (suchas a DNA molecule or a DNA oligomer) or a collection of nucleic acidmolecules (such as DNA molecules or DNA oligomers) where the moleculesare identical, or nearly so, and wherein the primer contains a regionthat is designed to hybridize to a targeted locus (e.g., a targetedpolymorphic locus or a non-polymorphic locus) or to a universal primingsequence, and may contain a priming sequence designed to allow PCRamplification. A primer may also contain a molecular barcode. A primermay contain a random region that differs for each individual molecule.

Library of primers refers to a population of two or more primers. Invarious embodiments, the library includes at least 100; 200; 500; 750;1,000; 2,000; 5,000; 7,500; 10,000; 20,000; 25,000; 30,000; 40,000;50,000; 75,000; or 100,000 different primers. In various embodiments,the library includes at least 100; 200; 500; 750; 1,000; 2,000; 5,000;7,500; 10,000; 20,000; 25,000; 30,000; 40,000; 50,000; 75,000; or100,000 different primer pairs, wherein each pair of primers includes aforward test primer and a reverse test primer where each pair of testprimers hybridize to a target locus. In some embodiments, the library ofprimers includes at least 100; 200; 500; 750; 1,000; 2,000; 5,000;7,500; 10,000; 20,000; 25,000; 30,000; 40,000; 50,000; 75,000; or100,000 different individual primers that each hybridize to a differenttarget locus, wherein the individual primers are not part of primerpairs. In some embodiments, the library has both (i) primer pairs and(ii) individual primers (such as universal primers) that are not part ofprimer pairs.

Different primers refers to non-identical primers.

Different pools refers to non-identical pools.

Different target loci refers to non-identical target loci.

Different amplicons refers to non-identical amplicons.

Hybrid Capture Probe refers to any nucleic acid sequence, possiblymodified, that is generated by various methods such as PCR or directsynthesis and intended to be complementary to one strand of a specifictarget DNA sequence in a sample. The exogenous hybrid capture probes maybe added to a prepared sample and hybridized through adenature-reannealing process to form duplexes of exogenous-endogenousfragments. These duplexes may then be physically separated from thesample by various means.

Sequence Read refers to data representing a sequence of nucleotide basesthat were measured, e.g., using a clonal sequencing method. Clonalsequencing may produce sequence data representing single, or clones, orclusters of one original DNA molecule. A sequence read may also haveassociated quality score at each base position of the sequenceindicating the probability that nucleotide has been called correctly.

Mapping a sequence read is the process of determining a sequence read'slocation of origin in the genome sequence of a particular organism. Thelocation of origin of sequence reads is based on similarity ofnucleotide sequence of the read and the genome sequence.

Matched Copy Error, also “Matching Chromosome Aneuploidy” (MCA), refersto a state of aneuploidy where one cell contains two identical or nearlyidentical chromosomes. This type of aneuploidy may arise during theformation of the gametes in meiosis, and may be referred to as a meioticnon-disjunction error. This type of error may arise in mitosis. Matchingtrisomy may refer to the case where three copies of a given chromosomeare present in an individual and two of the copies are identical.

Unmatched Copy Error, also “Unique Chromosome Aneuploidy” (UCA), refersto a state of aneuploidy where one cell contains two chromosomes thatare from the same parent, and that may be homologous but not identical.This type of aneuploidy may arise during meiosis, and may be referred toas a meiotic error. Unmatching trisomy may refer to the case where threecopies of a given chromosome are present in an individual and two of thecopies are from the same parent, and are homologous, but are notidentical. Note that unmatching trisomy may refer to the case where twohomologous chromosomes from one parent are present, and where somesegments of the chromosomes are identical while other segments aremerely homologous.

Homologous Chromosomes refers to chromosome copies that contain the sameset of genes that normally pair up during meiosis.

Identical Chromosomes refers to chromosome copies that contain the sameset of genes, and for each gene they have the same set of alleles thatare identical, or nearly identical.

Allele Drop Out (ADO) refers to the situation where at least one of thebase pairs in a set of base pairs from homologous chromosomes at a givenallele is not detected.

Locus Drop Out (LDO) refers to the situation where both base pairs in aset of base pairs from homologous chromosomes at a given allele are notdetected.

Homozygous refers to having similar alleles as corresponding chromosomalloci.

Heterozygous refers to having dissimilar alleles as correspondingchromosomal loci.

Heterozygosity Rate refers to the rate of individuals in the populationhaving heterozygous alleles at a given locus. The heterozygosity ratemay also refer to the expected or measured ratio of alleles, at a givenlocus in an individual, or a sample of DNA or RNA.

Chromosomal Region refers to a segment of a chromosome, or a fullchromosome.

Segment of a Chromosome refers to a section of a chromosome that canrange in size from one base pair to the entire chromosome.

Chromosome refers to either a full chromosome, or a segment or sectionof a chromosome.

Copies refers to the number of copies of a chromosome segment. It mayrefer to identical copies, or to non-identical, homologous copies of achromosome segment wherein the different copies of the chromosomesegment contain a substantially similar set of loci, and where one ormore of the alleles are different. Note that in some cases ofaneuploidy, such as the M2 copy error, it is possible to have somecopies of the given chromosome segment that are identical as well assome copies of the same chromosome segment that are not identical.

Haplotype refers to a combination of alleles at multiple loci that aretypically inherited together on the same chromosome. Haplotype may referto as few as two loci or to an entire chromosome depending on the numberof recombination events that have occurred between a given set of loci.Haplotype can also refer to a set of SNPs on a single chromatid that arestatistically associated.

Haplotypic Data, also “Phased Data” or “Ordered Genetic Data,” refers todata from a single chromosome or chromosome segment in a diploid orpolyploid genome, e.g., either the segregated maternal or paternal copyof a chromosome in a diploid genome.

Phasing refers to the act of determining the haplotypic genetic data ofan individual given unordered, diploid (or polyploidy) genetic data. Itmay refer to the act of determining which of two genes at an allele, fora set of alleles found on one chromosome, are associated with each ofthe two homologous chromosomes in an individual.

Phased Data refers to genetic data where one or more haplotypes havebeen determined.

Hypothesis refers to a possible state, such as a possible degree ofoverrepresentation of the number of copies of a first homologouschromosome or chromosome segment as compared to a second homologouschromosome or chromosome segment, a possible deletion, a possibleduplication, a possible ploidy state at a given set of one or morechromosomes or chromosome segments, a possible allelic state at a givenset of one or more loci, a possible paternity relationship, or apossible DNA, RNA, fetal fraction at a given set of one or morechromosomes or chromosome segment, or a set of quantities of geneticmaterial from a set of loci. The genetic states can optionally be linkedwith probabilities indicating the relative likelihood of each of theelements in the hypothesis being true in relation to other elements inthe hypothesis, or the relative likelihood of the hypothesis as a wholebeing true. The set of possibilities may comprise one or more elements.

Copy Number Hypothesis, also “Ploidy State Hypothesis,” refers to ahypothesis concerning the number of copies of a chromosome or chromosomesegment in an individual. It may also refer to a hypothesis concerningthe identity of each of the chromosomes, including the parent of originof each chromosome, and which of the parent's two chromosomes arepresent in the individual. It may also refer to a hypothesis concerningwhich chromosomes, or chromosome segments, if any, from a relatedindividual correspond genetically to a given chromosome from anindividual.

Related Individual refers to any individual who is genetically relatedto, and thus shares haplotype blocks with, the target individual. In onecontext, the related individual may be a genetic parent of the targetindividual, or any genetic material derived from a parent, such as asperm, a polar body, an embryo, a fetus, or a child. It may also referto a sibling, parent, or grandparent.

Sibling refers to any individual whose genetic parents are the same asthe individual in question. In some embodiments, it may refer to a bornchild, an embryo, or a fetus, or one or more cells originating from aborn child, an embryo, or a fetus. A sibling may also refer to a haploidindividual that originates from one of the parents, such as a sperm, apolar body, or any other set of haplotypic genetic matter. An individualmay be considered to be a sibling of itself.

Child may refer to an embryo, a blastomere, or a fetus. Note that in thepresently disclosed embodiments, the concepts described apply equallywell to individuals who are a born child, a fetus, an embryo, or a setof cells therefrom. The use of the term child may simply be meant toconnote that the individual referred to as the child is the geneticoffspring of the parents.

Fetal refers to “of the fetus,” or “of the region of the placenta thatis genetically similar to the fetus”. In a pregnant woman, some portionof the placenta is genetically similar to the fetus, and the freefloating fetal DNA found in maternal blood may have originated from theportion of the placenta with a genotype that matches the fetus. Notethat the genetic information in half of the chromosomes in a fetus isinherited from the mother of the fetus. In some embodiments, the DNAfrom these maternally inherited chromosomes that came from a fetal cellis considered to be “of fetal origin,” not “of maternal origin.”

DNA of Fetal Origin refers to DNA that was originally part of a cellwhose genotype was essentially equivalent to that of the fetus.

DNA of Maternal Origin refers to DNA that was originally part of a cellwhose genotype was essentially equivalent to that of the mother.

Parent refers to the genetic mother or father of an individual. Anindividual typically has two parents, a mother and a father, though thismay not necessarily be the case such as in genetic or chromosomalchimerism. A parent may be considered to be an individual.

Parental Context refers to the genetic state of a given SNP, on each ofthe two relevant chromosomes for one or both of the two parents of thetarget.

Maternal Plasma refers to the plasma portion of the blood from a femalewho is pregnant.

Clinical Decision refers to any decision to take or not take an actionthat has an outcome that affects the health or survival of anindividual. A clinical decision may also refer to a decision to conductfurther testing, to abort or maintain a pregnancy, to take actions tomitigate an undesirable phenotype, or to take actions to prepare for aphenotype.

Diagnostic Box refers to one or a combination of machines designed toperform one or a plurality of aspects of the methods disclosed herein.In an embodiment, the diagnostic box may be placed at a point of patientcare. In an embodiment, the diagnostic box may perform targetedamplification followed by sequencing. In an embodiment the diagnosticbox may function alone or with the help of a technician.

Informatics Based Method refers to a method that relies heavily onstatistics to make sense of a large amount of data. In the context ofprenatal diagnosis, it refers to a method designed to determine theploidy state at one or more chromosomes or chromosome segments, theallelic state at one or more alleles, or paternity by statisticallyinferring the most likely state, rather than by directly physicallymeasuring the state, given a large amount of genetic data, for examplefrom a molecular array or sequencing. In an embodiment of the presentdisclosure, the informatics based technique may be one disclosed in thispatent application. In an embodiment of the present disclosure it may bePARENTAL SUPPORT.

Primary Genetic Data refers to the analog intensity signals that areoutput by a genotyping platform. In the context of SNP arrays, primarygenetic data refers to the intensity signals before any genotype callinghas been done. In the context of sequencing, primary genetic data refersto the analog measurements, analogous to the chromatogram, that comesoff the sequencer before the identity of any base pairs have beendetermined, and before the sequence has been mapped to the genome.

Secondary Genetic Data refers to processed genetic data that are outputby a genotyping platform. In the context of a SNP array, the secondarygenetic data refers to the allele calls made by software associated withthe SNP array reader, wherein the software has made a call whether agiven allele is present or not present in the sample. In the context ofsequencing, the secondary genetic data refers to the base pairidentities of the sequences have been determined, and possibly alsowhere the sequences have been mapped to the genome.

Preferential Enrichment of DNA that corresponds to a locus, orpreferential enrichment of DNA at a locus, refers to any method thatresults in the percentage of molecules of DNA in a post-enrichment DNAmixture that correspond to the locus being higher than the percentage ofmolecules of DNA in the pre-enrichment DNA mixture that correspond tothe locus. The method may involve selective amplification of DNAmolecules that correspond to a locus. The method may involve removingDNA molecules that do not correspond to the locus. The method mayinvolve a combination of methods. The degree of enrichment is defined asthe percentage of molecules of DNA in the post-enrichment mixture thatcorrespond to the locus divided by the percentage of molecules of DNA inthe pre-enrichment mixture that correspond to the locus. Preferentialenrichment may be carried out at a plurality of loci. In someembodiments of the present disclosure, the degree of enrichment isgreater than 20, 200, or 2,000. When preferential enrichment is carriedout at a plurality of loci, the degree of enrichment may refer to theaverage degree of enrichment of all of the loci in the set of loci.

Amplification refers to a method that increases the number of copies ofa molecule of DNA or RNA.

Selective Amplification may refer to a method that increases the numberof copies of a particular molecule of DNA (or RNA), or molecules of DNA(or RNA) that correspond to a particular region of DNA (or RNA). It mayalso refer to a method that increases the number of copies of aparticular targeted molecule of DNA (or RNA), or targeted region of DNA(or RNA) more than it increases non-targeted molecules or regions of DNA(or RNA). Selective amplification may be a method of preferentialenrichment.

Universal Priming Sequence refers to a DNA (or RNA) sequence that may beappended to a population of target DNA (or RNA) molecules, for exampleby ligation, PCR, or ligation mediated PCR. Once added to the populationof target molecules, primers specific to the universal priming sequencescan be used to amplify the target population using a single pair ofamplification primers. Universal priming sequences are typically notrelated to the target sequences.

Universal Adapters, or “ligation adaptors” or “library tags” are nucleicacid molecules containing a universal priming sequence that can becovalently linked to the 5-prime and 3-prime end of a population oftarget double stranded nucleic acid molecules. The addition of theadapters provides universal priming sequences to the 5-prime and 3-primeend of the target population from which PCR amplification can takeplace, amplifying all molecules from the target population, using asingle pair of amplification primers.

Targeting refers to a method used to selectively amplify or otherwisepreferentially enrich those molecules of DNA (or RNA) that correspond toa set of loci in a mixture of DNA (or RNA).

Joint Distribution Model refers to a model that defines the probabilityof events defined in terms of multiple random variables, given aplurality of random variables defined on the same probability space,where the probabilities of the variable are linked. In some embodiments,the degenerate case where the probabilities of the variables are notlinked may be used.

Cancer-related gene refers to a gene associated with an altered risk fora cancer or an altered prognosis for a cancer. Exemplary cancer-relatedgenes that promote cancer include oncogenes; genes that enhance cellproliferation, invasion, or metastasis; genes that inhibit apoptosis;and pro-angiogenesis genes. Cancer-related genes that inhibit cancerinclude, but are not limited to, tumor suppressor genes; genes thatinhibit cell proliferation, invasion, or metastasis; genes that promoteapoptosis; and anti-angiogenesis genes.

Estrogen-related cancer refers to a cancer that is modulated byestrogen. Examples of estrogen-related cancers include, withoutlimitation, breast cancer and ovarian cancer. Her2 is overexpressed inmany estrogen-related cancers (U.S. Pat. No. 6,165,464, which is herebyincorporated by reference in its entirety).

Androgen-related cancer refers to a cancer that is modulated byandrogen. An example of androgen-related cancers is prostate cancer.

Higher than normal expression level refers to expression of an mRNA orprotein at a level that is higher than the average expression level ofthe corresponding molecule in control subjects (such as subjects withouta disease or disorder such as cancer). In various embodiments, theexpression level is at least 20, 40, 50, 75, 90, 100, 200, 500, or even1000% higher than the level in control subjects.

Lower than normal expression level refers to expression of an mRNA orprotein at a level that is lower than the average expression level ofthe corresponding molecule in control subjects (such as subjects withouta disease or disorder such as cancer). In various embodiments, theexpression level is at least 20, 40, 50, 75, 90, 95, or 100% lower thanthe level in control subjects. In some embodiments, the expression ofthe mRNA or protein is not detectable.

Modulate expression or activity refers to either increasing ordecreasing expression or activity, for example, of a protein or nucleicacid sequence, relative to control conditions. In some embodiments, themodulation in expression or activity is an increase or decrease of atleast 10, 20, 40, 50, 75, 90, 100, 200, 500, or even 1000%. In variousembodiments, transcription, translation, mRNA or protein stability, orthe binding of the mRNA or protein to other molecules in vivo ismodulated by the therapy. In some embodiments, the level of mRNA isdetermined by standard Northern blot analysis, and the level of proteinis determined by standard Western blot analysis, such as the analysesdescribed herein or those described by, for example, Ausubel et al.(Current Protocols in Molecular Biology, John Wiley & Sons, New York,Jul. 11, 2013, which is hereby incorporated by reference in itsentirety). In one embodiment, the level of a protein is determined bymeasuring the level of enzymatic activity, using standard methods. Inanother preferred embodiment, the level of mRNA, protein, or enzymaticactivity is equal to or less than 20, 10, 5, or 2-fold above thecorresponding level in control cells that do not express a functionalform of the protein, such as cells homozygous for a nonsense mutation.In yet another embodiment, the level of mRNA, protein, or enzymaticactivity is equal to or less than 20, 10, 5, or 2-fold above thecorresponding basal level in control cells, such as non-cancerous cells,cells that have not been exposed to conditions that induce abnormal cellproliferation or that inhibit apoptosis, or cells from a subject withoutthe disease or disorder of interest.

Dosage sufficient to modulate mRNA or protein expression or activityrefers to an amount of a therapy that increases or decreases mRNA orprotein expression or activity when administered to a subject. In someembodiments, for a compound that decreases expression or activity, themodulation is a decrease in expression or activity that is at least 10%,30%, 40%, 50%, 75%, or 90% lower in a treated subject than in the samesubject prior to the administration of the inhibitor or than in anuntreated, control subject. In addition, In some embodiments, for acompound that increases expression or activity, the amount of expressionor activity of the mRNA or protein is at least 1.5-, 2-, 3-, 5-, 10-, or20-fold greater in a treated subject than in the same subject prior tothe administration of the modulator or than in an untreated, controlsubject.

In some embodiments, compounds may directly or indirectly modulate theexpression or activity of the mRNA or protein. For example, a compoundmay indirectly modulate the expression or activity of an mRNA or proteinof interest by modulating the expression or activity of a molecule(e.g., a nucleic acid, protein, signaling molecule, growth factor,cytokine, or chemokine) that directly or indirectly affects theexpression or activity of the mRNA or protein of interest. In someembodiments, the compounds inhibit cell division or induce apoptosis.These compounds in the therapy may include, for example, unpurified orpurified proteins, antibodies, synthetic organic molecules,naturally-occurring organic molecules, nucleic acid molecules, andcomponents thereof. The compounds in a combination therapy may beadministered simultaneously or sequentially. Exemplary compounds includesignal transduction inhibitors.

Purified refers to being separated from other components that naturallyaccompany it. Typically, a factor is substantially pure when it is atleast 50%, by weight, free from proteins, antibodies, andnaturally-occurring organic molecules with which it is naturallyassociated. In some embodiments, the factor is at least 75%, 90%, or99%, by weight, pure. A substantially pure factor may be obtained bychemical synthesis, separation of the factor from natural sources, orproduction of the factor in a recombinant host cell that does notnaturally produce the factor. Proteins and small molecules may bepurified by one skilled in the art using standard techniques such asthose described by Ausubel et al. (Current Protocols in MolecularBiology, John Wiley & Sons, New York, Jul. 11, 2013, which is herebyincorporated by reference in its entirety). In some embodiments thefactor is at least 2, 5, or 10 times as pure as the starting material,as measured using polyacrylamide gel electrophoresis, columnchromatography, optical density, HPLC analysis, or western analysis(Ausubel et al., supra). Exemplary methods of purification includeimmunoprecipitation, column chromatography such as immunoaffinitychromatography, magnetic bead immunoaffinity purification, and panningwith a plate-bound antibody.

Other features and advantages of the invention will be apparent from thefollowing detailed description and from the claims.

BRIEF DESCRIPTION OF THE DRAWINGS

The patent or application file contains at least one drawing executed incolor. Copies of this patent or patent application publication withcolor drawing(s) will be provided by the Office upon request and paymentof the necessary fee.

The presently disclosed embodiments will be further explained withreference to the attached drawings, wherein like structures are referredto by like numerals throughout the several views. The drawings shown arenot necessarily to scale, with emphasis instead generally being placedupon illustrating the principles of the presently disclosed embodiments.

FIGS. 1A-1D are graphs showing the distribution of the test statistic Sdivided by T (the number of SNPs) (“S/T”) for various copy numberhypotheses for a depth of read (DOR) of 500 and a tumor fraction of 1%for an increasing number of SNPs. FIG. 1A: 100 SNPs, FIG. 1B: 333 SNPs,FIG. 1C: 667 SNPs, FIG. 1D: 1000 SNPs.

FIGS. 2A-2D are graphs showing the distribution of S/T for various copynumber hypotheses for a DOR of 500 and tumor fraction of 2% for anincreasing number of SNPs. FIG. 2A: 100 SNPs, FIG. 2B: 333 SNPs, FIG.2C: 667 SNPs, FIG. 2D: 1000 SNPs.

FIGS. 3A-3D are graphs showing the distribution of S/T for various copynumber hypotheses for a DOR of 500 and tumor fraction of 3% for anincreasing number of SNPs. FIG. 3A: 100 SNPs, FIG. 3B: 333 SNPs, FIG.3C: 667 SNPs, FIG. 3D: 1000 SNPs.

FIGS. 4A-4D are graphs showing the distribution of S/T for various copynumber hypotheses for a DOR of 500 and tumor fraction of 4% for anincreasing number of SNPs. FIG. 4A: 100 SNPs, FIG. 4B: 333 SNPs, FIG.4C: 667 SNPs, FIG. 4D: 1000 SNPs.

FIGS. 5A-5D are graphs showing the distribution of S/T for various copynumber hypotheses for a DOR of 500 and tumor fraction of 5% for anincreasing number of SNPs. FIG. 5A: 100 SNPs, FIG. 5B: 333 SNPs, FIG.5C: 667 SNPs, FIG. 5D: 1000 SNPs.

FIGS. 6A-6D are graphs showing the distribution of S/T for various copynumber hypotheses for a DOR of 500 and tumor fraction of 6% for anincreasing number of SNPs. FIG. 6A: 100 SNPs, FIG. 6B: 333 SNPs, FIG.6C: 667 SNPs, FIG. 6D: 1000 SNPs.

FIGS. 7A-7D are graphs showing the distribution of S/T for various copynumber hypotheses for a DOR of 1000 and tumor fraction of 0.5% for anincreasing number of SNPs. FIG. 7A: 100 SNPs, FIG. 7B: 333 SNPs, FIG.7C: 667 SNPs, FIG. 7D: 1000 SNPs.

FIGS. 8A-8D are graphs showing the distribution of S/T for various copynumber hypotheses for a DOR of 1000 and tumor fraction of 1% for anincreasing number of SNPs. FIG. 8A: 100 SNPs, FIG. 8B: 333 SNPs, FIG.8C: 667 SNPs, FIG. 8D: 1000 SNPs.

FIGS. 9A-9D are graphs showing the distribution of S/T for various copynumber hypotheses for a DOR of 1000 and tumor fraction of 2% for anincreasing number of SNPs. FIG. 9A: 100 SNPs, FIG. 9B: 333 SNPs, FIG.9C: 667 SNPs, FIG. 9D: 1000 SNPs.

FIGS. 10A-10D are graphs showing the distribution of S/T for variouscopy number hypotheses for a DOR of 1000 and tumor fraction of 3% for anincreasing number of SNPs. FIG. 10A: 100 SNPs, FIG. 10B: 333 SNPs, FIG.10C: 667 SNPs, FIG. 10D: 1000 SNPs.

FIGS. 11A-11D are graphs showing the distribution of S/T for variouscopy number hypotheses for a DOR of 1000 and tumor fraction of 4% for anincreasing number of SNPs. FIG. 11A: 100 SNPs, FIG. 11B: 333 SNPs, FIG.11C: 667 SNPs, FIG. 11D: 1000 SNPs.

FIGS. 12A-12D are graphs showing the distribution of S/T for variouscopy number hypotheses for a DOR of 3000 and tumor fraction of 0.5% foran increasing number of SNPs. FIG. 12A: 100 SNPs, FIG. 12B: 333 SNPs,FIG. 12C: 667 SNPs, FIG. 12D: 1000 SNPs.

FIGS. 13A-13D are graphs showing the distribution of S/T for variouscopy number hypotheses for a DOR of 3000 and tumor fraction of 1% for anincreasing number of SNPs. FIG. 13A: 100 SNPs, FIG. 13B: 333 SNPs, FIG.13C: 667 SNPs, FIG. 13D: 1000 SNPs.

FIG. 14 is a table indicating the sensitivity and specificity fordetecting six microdeletion syndromes.

FIG. 15 is a graphical representation of euploidy. The x-axis representsthe linear position of the individual polymorphic loci along thechromosome, and the y-axis represents the number of A allele reads as afraction of the total (A+B) allele reads. Maternal and fetal genotypesare indicated to the right of the plots. The plots are symbol-codedaccording to maternal genotype, such that solid circles indicate amaternal genotype of AA, solid squares indicate a maternal genotype ofBB, and open triangles indicate a maternal genotype of AB. The left plotis a plot of when two chromosomes are present, and the fetal cfDNAfraction is 0%. This plot is from a non-pregnant woman, and thusrepresents the pattern when the genotype is entirely maternal. Alleleclusters are thus centered around 1 (AA alleles), 0.5 (AB alleles), and0 (BB alleles). The center plot is a plot of when two chromosomes arepresent, and the fetal fraction is 12%. The contribution of fetalalleles to the fraction of A allele reads shifts the position of someallele spots up or down along the y-axis. The right plot is a plot ofwhen two chromosomes are present, and the fetal fraction is 26%. Thepattern, including two solid circle and two solid square peripheralbands and a trio of central open triangles, is readily apparent.

FIGS. 16A and 16B are graphical representations of 22q11.2 deletionsyndrome. FIG. 16A is for maternal 22q11.2 deletion carrier (asindicated by the absence of the open triangles indicating AB SNPs). FIG.16B is for a paternally inherited 22q11 deletion in a fetus (asindicated by the presence of solid circle and solid square peripheralbands). The x-axis represents the linear position of the SNPs, and they-axis indicates the fraction of A allele reads out of the total reads.Each individual circle, triangle or square represents a single SNPlocus.

FIG. 17 is a graphical representation of maternally inheritedCri-du-Chat deletion syndrome (as indicated by the presence of twocentral open triangle shape bands instead of three open triangle shapebands). The x-axis represents the linear position of the SNPs, and they-axis indicates the fraction of A allele reads out of the total reads.Each individual circle, triangle or square represents a single SNPlocus.

FIG. 18 is a graphical representation of paternally inheritedWolf-Hirschhorn deletion syndrome (as indicated by the presence of solidcircle and solid square peripheral bands). The x-axis represents thelinear position of the SNPs, and the y-axis indicates the fraction of Aallele reads out of the total reads. Each individual circle, triangle orsquare represents a single SNP locus.

FIGS. 19A-19D are graphical representations of X chromosome spike-inexperiments to represent an extra copy of a chromosome or chromosomesegment. The plots show different amounts of DNA from a father mixedwith DNA from the daughter: 16% father DNA (FIG. 19A), 10% father DNA(FIG. 19B), 1% father DNA (FIG. 19C), and 0.1% father DNA (FIG. 19D).The x-axis represents the linear position of the SNPs on the Xchromosome, and the y-axis indicates the fraction of M allele reads outof the total reads (M+R). Each individual criss-cross, circle, triangleor square represents a single SNP locus with allele M or R.

FIGS. 20A and 20B are graphs of the false negative rate using haplotypedata (FIG. 20A) and without haplotype data (FIG. 20B).

FIGS. 21A and 21B are graphs of the false positive rate for p=1% usinghaplotype data (FIG. 21A) and without haplotype data (FIG. 21B).

FIGS. 22A and 22B are graphs of the false positive rate for p=1.5% usinghaplotype data (FIG. 22A) and without haplotype data (FIG. 22B).

FIGS. 23A and 23B are graphs of the false positive rate for p=2% usinghaplotype data (FIG. 23A) and without haplotype data (FIG. 23B).

FIGS. 24A and 24B are graphs of the false positive rate for p=2.5% usinghaplotype data (FIG. 24A) and without haplotype data (FIG. 24B).

FIGS. 25A and 25B are graphs of the false positive rate for p=3% usinghaplotype data (FIG. 25A) and without haplotype data (FIG. 25B).

FIG. 26 is a table of false positive rates for the first simulation.

FIG. 27 is a table of false negative rates for the first simulation.

FIG. 28 contains a graph of reference counts (counts of one allele, suchas the “A” allele) divided by total counts for that locus for a normal(noncancerous) cell line, a graph of reference counts divided by totalcounts for a cancer cell line with a deletion and a graph of referencecounts divided by total counts for a mixture of DNA from the normal cellline (95%) and the cancer cell line (5%).

FIG. 29 is a graph of reference counts divided by total counts for aplasma sample from a patient with stage IIa breast cancer with a tumorfraction estimated to be 4.33% (in which 4.33% of the DNA is from tumorcells). The diamond portion of the graph represents a region in which noCNV is present. The portion of the graph with solid circles and squaresrepresents a region in which a CNV is present and there is a visibleseparation of the measured allele ratios from the expected allele ratioof 0.5. The solid square indicates one haplotype, and the solid circleindicates the other haplotype. Approximately 636 heterozygous SNPs wereanalyzed in the region of the CNV.

FIG. 30 is a graph of reference counts divided by total counts for aplasma sample from a patient with stage IIb breast cancer with a tumorfraction estimated to be 0.58%. The open diamonds of the graphrepresents a region in which no CNV is present. The portion of the graphwith solid circles and squares represents a region in which a CNV ispresent but there is no clearly visible separation of the measuredallele ratios from the expected allele ratio of 0.5. For this analysis,86 heterozygous SNPs were analyzed in the region of the CNV.

FIGS. 31A and 31B are graphs showing the maximum likelihood estimationof the tumor fraction. The maximum likelihood estimate is indicated bythe peak of the graph and is 4.33% for FIG. 31A and 0.58% for FIG. 31B.

FIG. 32A is a comparison of the graphs of the log of the odds ratio forvarious possible tumor fractions for the high tumor fraction sample(4.33%) and the low tumor fraction sample (0.58%). If the log odds ratiois less than 0, the euploid hypothesis is more likely. If the log oddsratio is greater than 0, the presence of a CNV is more likely.

FIG. 32B is a graph of small tumor results plotted in probability space.The graph depicts the probability of a deletion divided by theprobability of no deletion for various possible tumor fractions for thelow tumor fraction sample (0.58%).

FIG. 33 is a graph of the log of the odds ratio for various possibletumor fractions for the low tumor fraction sample (0.58%).

FIG. 33 is an enlarged version of the graph in FIG. 32A for the lowtumor fraction sample.

FIG. 34 is a graph showing the limit of detection for single nucleotidevariants in a tumor biopsy using three different methods described inExample 6.

FIG. 35 is a graph showing the limit of detection for single nucleotidevariants in a plasma sample using three different methods described inExample 6.

FIGS. 36A and 36B are graphs of the analysis of genomic DNA (FIG. 36A)or DNA from a single cell (FIG. 36B) using a library of approximately28,000 primers designed to detect CNVs. The presence of two centralbands instead of one central band indicates the presence of a CNV. Thex-axis represents the linear position of the SNPs, and the y-axisindicates the fraction of A allele reads out of the total reads.

FIGS. 37A and 37B are graphs of the analysis of genomic DNA (FIG. 37A)or DNA from a single cell (FIG. 37B) using a library of approximately3,000 primers designed to detect CNVs. The presence of two central bandsinstead of one central band indicates the presence of a CNV. The x-axisrepresents the linear position of the SNPs, and the y-axis indicates thefraction of A allele reads out of the total reads.

FIG. 38 is a graph illustrating the uniformity in DOR for these ˜3,000loci.

FIG. 39 is a table comparing error call metrics for genomic DNA and DNAfrom a single cell.

FIG. 40 is a graph of error rates for transition mutations andtransversion mutations.

FIGS. 41A-D are graphs of Sensitivity of CoNVERGe determined withPlasmArts. FIG. 41A: Correlation between CoNVERGe-calculated AAI andactual input fraction in PlasmArt samples with DNA from a 22q11.2deletion and matched normal cell lines. FIG. 41B: Correlation betweencalculated AAI and actual tumour DNA input in PlasmArt samples with DNAfrom HCC2218 breast cancer cells with chromosome 2p and 2q CNVs andmatched normal HCC2218BL cells, containing 0-9.09% tumour DNA fractions.FIG. 41C: Correlation between calculated AAI and actual tumour DNA inputin PlasmArt samples with DNA from HCC1954 breast cancer cells withchromosome 1p and 1q CNVs and matched normal HCC1954BL cells, containing0-5.66% tumour DNA fractions. FIG. 41D: Allele frequency plot forHCC1954 cells used in FIG. C. In FIGS. 41A-C, data points and error barsindicate the mean and standard deviation (SD), respectively, of 3-8replicates.

FIGS. 42A-B provide a model system for validation. Plasmart samples weremade from cell lines with similar size profiles to plasma. FIG. 42Aillustrates a son's plasma with a 22q11 deletion spiked into thefather's plasma. Focal CNV: 3 MB. FIG. 42B illustrates Chromosomes 1 and2: cancer cell lines into normal cell line of same individual. CNVs onchromosome arms 1p, 1q, 2p, 2q. FIGS. 42A and 42B are graphs showingfragment size distributions of an exemplary Plasmart standard.

FIG. 43A, FIG. 43B, FIG. 43C, and FIG. 43D provide results from adilution curve of Plasmart synthetic ctDNA standards for validation ofmicrodeletion and cancer panels. FIG. 43A is a graph showing the maximumlikelihood of tumor. FIG. 43B is an estimate of DNA fraction results asan odds ratio plot. FIG. 43C is a plot for the detection of transversionevents. FIG. 43D is a plot for the detection of Transition events.

FIG. 44 is a plot showing CNVs for various chromosomal regions asindicated for various samples at different % ctDNAs. The plot depictsplasma from 21 breast cancer patients (stage 1-IIIB) and demonstratedthat CNVs could be detected in ctDNA with an AAI≥0.45% and required asfew as 62 heterozygous SNPs.

FIG. 45 is a plot showing CNVs for various chromosomal regions forvarious ovarian cancer samples with different % ctDNA levels. The plotindicates 100% detection rate at a 9.45% cutoff.

FIG. 46A is a table showing the percent of breast or lung cancerpatients with an SNV or a combined SNV and/or CNV in ctDNA. The analysiswas on ctDNA (plasma) from Stage I-III cancer patients and indicatesthat the ability to detect CNV in plasma dramatically improves detectionrate vs. testing SNVs alone. FIG. 46B plots cumulative proportion TCGAbreast cancer patients covered vs. genes with breast SNVs. FIG. 46Cplots cumulative COSMIC patient capture vs. cumulative patient coverage(TCGA) for breast deletions. FIG. 46D plots cumulative COSMIC patientcapture vs. cumulative patient coverage (TCGA) for breastamplifications.

FIG. 47A is a graph of % samples at different breast cancer stages withtumor-specific SNVs and/or CNVs in plasma. FIG. 47B is a table ofpercent detection of breast CNVs and SNVs by stage.

FIG. 48A is a graph of % samples at different breast cancer substageswith tumor-specific SNVs and/or CNVs in plasma. FIG. 48B is a table ofpercent detection of breast CNVs and SNVs by tumor substage.

FIG. 49A is a graph of % samples at different lung cancer stages withtumor-specific SNVs and/or CNVs in plasma. FIG. 49B is a table of lungplasma detection rate of lung SNVs and/or CNVs.

FIG. 50A is a graph of % samples at different lung cancer substages withtumor-specific SNVs and/or CNVs in plasma. FIG. 50B is a table of lungplasma detection rate of lung SNVs and/or CNVs by tumor substage.

FIG. 51A represents the histological finding/history for primary lungtumors analyzed for clonal and subclonal tumor heterogeneity. FIG. 51Bis a table of the VAF identities of the biopsied lung tumors by wholegenome sequencing and assaying by AmpliSEQ.

FIG. 52 illustrates the use of ctDNA from plasma to identify both clonaland subclonal SNV mutations to overcome tumor heterogeneity.

FIG. 53A is a table comparing VAF calls by AmpliSeq. FIG. 53B is a tablecomparing VAF calls by mmPCR-NGS. A comparison of the two tables fordetection of SNVs in primary tumor indicate that SNVs were missed byAmpliSeq and SNV mutations were identified in ctDNA from plasma withmmPCR-NGS.

FIG. 54A is a plot of % VAF in Primary Lung Tumor. FIG. 54B is a linearregression plot of AmpliSeq VAF vs. Natera VAF.

FIG. 55 is a graph of Pool 1/4 of an 84-plex SNV PCR primer reactionwhen primer concentration is limited.

FIG. 56 is a graph of Pool 2/4 of an 84-plex SNV PCR primer reactionwhen primer concentration is limited.

FIG. 57 is a graph of Pool 3/4 of an 84-plex SNV PCR primer reactionwhen primer concentration is limited.

FIG. 58 is a graph of Pool 4/4 of an 84-plex SNV PCR primer reactionwhen primer concentration is limited.

FIG. 59 illustrates a plot of Limit of Detection (LOD) vs. Depth of Read(DOR) for detection of SNV Transition and Transversion mutations in a84-plex PCR reaction at 15 PCR cycles.

FIG. 60 illustrates a plot of Limit of Detection (LOD) vs. Depth of Read(DOR) for detection of SNV Transition and Transversion mutations in a84-plex PCR reaction at 20 PCR cycles.

FIG. 61 illustrates a plot of Limit of Detection (LOD) vs. Depth of Read(DOR) for detection of SNV Transition and Transversion mutations in a84-plex PCR reaction at 25 PCR cycles.

FIG. 62A is a plot illustrating sensitivity of detection of SNVs intumor cell genomic DNA. FIG. 62B illustrates sensitivity of detection ofSNVs in 1/3 single cells. FIG. 62C illustrates sensitivity of detectionof SNVs in 2/3 single cells. FIG. 62D illustrates sensitivity ofdetection of SNVs in 3/3 single cells. Comparable sensitivities are seenbetween tumor and single cell genomic DNA.

FIG. 63A illustrates the workflow for analysis of CNVs in a variety ofcancer sample types in a massively multiplexed PCR (mmPCR) assaytargeting SNPs. FIG. 63B illustrates detection of somatic CNVs in humanbreast cancer cell lines and matched normal cell lines (FIG. 63C) on theCoNVERGe platform. FIG. 63D illustrates detection of somatic CNVs inhuman breast cancer cell lines and matched normal cell lines (FIG. 63E)on the CytoSNP-12 microarray platform. FIG. 63F is a plot of the maximumhomolog ratios for CNVs identified by CoNVERG3e or CytoSNP-12 showing astrong linear correlation of identified CNVs by either method.

FIGS. 64A-H provide a comparison of Fresh Frozen (FF) and FFPE(formalin-fixed paraffin embedded) breast cancer samples to matchedbuffy coat gDNA control samples. FIG. 64A is a FF breast tissue controlsample analyzed by CoNVERGe. FIG. 64B is a FFPE breast tissue controlsample analyzed by CoNVERGe. FIG. 64C is a FF breast tumour tissuesample analyzed by CoNVERGe. FIG. 64D is a FFPE breast tumour tissuesample analyzed by CoNVERGe. FIG. 64E is a FF breast tumour tissuesample analyzed by CytoSNP-12. FIG. 64F is a FFPE breast tumour tissuesample analyzed by CytoSNP-12. FIG. 64G compares the CoNVERGe assay to amicroarray assay on breast cancer cell lines and FIG. 64H compares theCoNVERGe assay to the OneScan assay on breast cancer cell lines.

FIGS. 65A-D illustrate Allele frequency plots to reflect chromosome copynumber using the CoNVERGe assay to detect CNVs in single cells. FIG. 65Ais the analysis of 1/3 breast cancer single cell replicates. FIG. 65B isthe analysis of 2/3 breast cancer single cell replicates. FIG. 65C isthe analysis of 3/3 breast cancer single cell replicates. FIG. 65D isthe analysis of a B-lymphocyte cell line lacking CNVs in the targetregions.

FIGS. 66A-C illustrate Allele frequency plots to reflect chromosome copynumber using the CoNVERGe assay to detect CNVs in real plasma samples.FIG. 66A is a stage II breast cancer plasma cfDNA sample and its matchedtumor biopsy gDNA. FIG. 66B is a late stage ovarian cancer plasma cfDNAsample and its matched tumor biopsy gDNA. FIG. 66C is a chartillustrating tumor heterogeneity as determined by CNV detection in fivelate stage ovarian cancer plasma and matched tissue samples.

FIGS. 67A-H lists the chromosome positions, SNVs and mutation change inbreast cancer.

FIGS. 68A-B illustrate the major (FIG. 68A) and minor allele (FIG. 68B)frequencies of SNPs used in a 3168 mmPCR reaction.

FIG. 69 shows an example system architecture X00 useful for performingembodiments of the present invention. System architecture X00 includesan analysis platform X08 and a laboratory information systems (“LISs”)X04. X04 can be connected to Genetic Data Source X10. X08 may beconnected to LIS X04 over a network X02. Analysis platform X08 mayalternatively or additionally be connected directly to LIS X06. LIS X06can be connected to Genetic Data Source X10. Analysis platform X08includes one or more of an input processor X12, a hypothesis managerX14, a modeler X16, an error correction unit X18, a machine learningunit X20, and an output processor X22.

FIG. 70 illustrates an example computer system Y00 for performingembodiments of the present invention. System architecture Y00 includesone or more processors Y10, a BUS Y20, a main memory Y30, a memorycontroller Y75, a communications and network interface Y80, acommunication path Y85, an input/output/display devices Y90, and mayalso include a secondary memory Y40. Y40 may include a hard disk driveY50 and a removable storage drive Y60. Y60 can write to a removablestorage unit Y70.

While the above-identified drawings set forth presently disclosedembodiments, other embodiments are also contemplated, as noted in thediscussion. This disclosure presents illustrative embodiments by way ofrepresentation and not limitation. Numerous other modifications andembodiments can be devised by those skilled in the art which fall withinthe scope and spirit of the principles of the presently disclosedembodiments.

DETAILED DESCRIPTION OF THE INVENTION

In one aspect, the present invention generally relates, at least inpart, to improved methods of determining the presence or absence of copynumber variations, such as deletions or duplications of chromosomesegments or entire chromosomes. The methods are particularly useful fordetecting small deletions or duplications, which can be difficult todetect with high specificity and sensitivity using prior methods due tothe small amount of data available from the relevant chromosome segment.The methods include improved analytical methods, improved bioassaymethods, and combinations of improved analytical and bioassay methods.Methods of the invention can also be used to detect deletions orduplications that are only present in a small percentage of the cells ornucleic acid molecules that are tested. This allows deletions orduplications to be detected prior to the occurrence of disease (such asat a precancerous stage) or in the early stages of disease, such asbefore a large number of diseased cells (such as cancer cells) with thedeletion or duplication accumulate. The more accurate detection ofdeletions or duplications associated with a disease or disorder enableimproved methods for diagnosing, prognosticating, preventing, delaying,stabilizing, or treating the disease or disorder. Several deletions orduplications are known to be associated with cancer or with severemental or physical handicaps.

In another aspect, the present invention generally relates, at least inpart, to improved methods of detecting single nucleotide variations(SNVs). These improved methods include improved analytical methods,improved bioassay methods, and improved methods that use a combinationof improved analytical and bioassay methods. The methods in certainillustrative embodiments are used to detect, diagnose, monitor, or stagecancer, for example in samples where the SNV is present at very lowconcentrations, for example less than 10%, 5%, 4%, 3%, 2.5%, 2%, 1%,0.5%, 0.25%, or 0.1% relative to the total number of normal copies ofthe SNV locus, such as circulating free DNA samples. That is, thesemethods in certain illustrative embodiments are particularly well suitedfor samples where there is a relatively low percentage of a mutation orvariant relative to the normal polymorphic alleles present for thatgenetic loci. Finally, provided herein are methods that combine theimproved methods for detecting copy number variations with the improvedmethods for detecting single nucleotide variations.

Successful treatment of a disease such as cancer often relies on earlydiagnosis, correct staging of the disease, selection of an effectivetherapeutic regimen, and close monitoring to prevent or detect relapse.For cancer diagnosis, histological evaluation of tumor material obtainedfrom tissue biopsy is often considered the most reliable method.However, the invasive nature of biopsy-based sampling has rendered itimpractical for mass screening and regular follow up. Therefore, thepresent methods have the advantage of being able to be performednon-invasively if desired for relatively low cost with fast turnaroundtime. The targeted sequencing that may be used by the methods of theinvention requires less reads than shotgun sequencing, such as a fewmillion reads instead of 40 million reads, thereby decreasing cost. Themultiplex PCR and next generation sequencing that may be used increasethroughput and reduces costs.

In some embodiments, the methods are used to detect a deletion,duplication, or single nucleotide variant in an individual. A samplefrom the individual that contains cells or nucleic acids suspected ofhaving a deletion, duplication, or single nucleotide variant may beanalyzed. In some embodiments, the sample is from a tissue or organsuspected of having a deletion, duplication, or single nucleotidevariant, such as cells or a mass suspected of being cancerous. Themethods of the invention can be used to detect deletion, duplication, orsingle nucleotide variant that are only present in one cell or a smallnumber of cells in a mixture containing cells with the deletion,duplication, or single nucleotide variant and cells without thedeletion, duplication, or single nucleotide variant. In someembodiments, cfDNA or cfRNA from a blood sample from the individual isanalyzed. In some embodiments, cfDNA or cfRNA is secreted by cells, suchas cancer cells. In some embodiments, cfDNA or cfRNA is released bycells undergoing necrosis or apoptosis, such as cancer cells. Themethods of the invention can be used to detect deletion, duplication, orsingle nucleotide variant that are only present in a small percentage ofthe cfDNA or cfRNA. In some embodiments, one or more cells from anembryo are tested.

In some embodiments, the methods are used for non-invasive or invasiveprenatal testing of a fetus. These methods can be used to determine thepresence or absence of deletions or duplications of a chromosome segmentor an entire chromosome, such as deletions or duplications known to beassociated severe mental or physical handicaps, learning disabilities,or cancer. In some embodiments for non-invasive prenatal testing (NIPT),cells, cfDNA or cfRNA from a blood sample from the pregnant mother istested. The methods allow the detection of a deletion or duplication inthe cells, cfDNA, or cfRNA from the fetus despite the large amount ofcells, cfDNA, or cfRNA from the mother that is also present. In someembodiments for invasive prenatal testing, DNA or RNA from a sample fromthe fetus is tested (such as a CVS or amniocentesis sample). Even if thesample is contaminated with DNA or RNA from the pregnant mother, themethods can be used to detect a deletion or duplication in the fetal DNAor RNA.

In addition to determining the presence or absence of copy numbervariation, one or more other factors can be analyzed if desired. Thesefactors can be used to increase the accuracy of the diagnosis (such asdetermining the presence or absence of cancer or an increased risk forcancer, classifying the cancer, or staging the cancer) or prognosis.These factors can also be used to select a particular therapy ortreatment regimen that is likely to be effective in the subject.Exemplary factors include the presence or absence of polymorphisms ormutation; altered (increased or decreased) levels of total or particularcfDNA, cfRNA, microRNA (miRNA); altered (increased or decreased) tumorfraction; altered (increased or decreased) methylation levels, altered(increased or decreased) DNA integrity, altered (increased or decreased)or alternative mRNA splicing.

The following sections describe methods for detecting deletions orduplications using phased data (such as inferred or measured phaseddata) or unphased data; samples that can be tested; methods for samplepreparation, amplification, and quantification; methods for phasinggenetic data; polymorphisms, mutations, nucleic acid alterations, mRNAsplicing alterations, and changes in nucleic acid levels that can bedetected; databases with results from the methods, other risk factorsand screening methods; cancers that can be diagnosed or treated; cancertreatments; cancer models for testing treatments; and methods forformulating and administering treatments.

Exemplary Methods for Determining Ploidy Using Phased Data

Some of the methods of the invention are based in part on the discoverythat using phased data for detecting CNVs decreases the false negativeand false positive rates compared to using unphased data (FIGS. 20A-27). This improvement is greatest for samples with CNVs present in lowlevels. Thus, phase data increases the accuracy of CNV detectioncompared to using unphased data (such as methods that calculate alleleratios at one or more loci or aggregate allele ratios to give anaggregated value (such as an average value) over a chromosome orchromosome segment without considering whether the allele ratios atdifferent loci indicate that the same or different haplotypes appear tobe present in an abnormal amount). Using phased data allows a moreaccurate determination to be made of whether differences betweenmeasured and expected allele ratios are due to noise or due to thepresence of a CNV. For example, if the differences between measured andexpected allele ratios at most or all of the loci in a region indicatethat the same haplotype is overrepresented, then a CNV is more likely tobe present. Using linkage between alleles in a haplotype allows one todetermine whether the measured genetic data is consistent with the samehaplotype being overrepresented (rather than random noise). In contrast,if the differences between measured and expected allele ratios are onlydue to noise (such as experimental error), then in some embodiments,about half the time the first haplotype appears to be overrepresentedand about the other half of the time, the second haplotype appears to beoverrepresented.

Accuracy can be increased by taking into account the linkage betweenSNPs, and the likelihood of crossovers having occurred during themeiosis that gave rise to the gametes that formed the embryo that grewinto the fetus. Using linkage when creating the expected distribution ofallele measurements for one or more hypotheses allows the creation ofexpected allele measurements distributions that correspond to realityconsiderably better than when linkage is not used. For example, imaginethat there are two SNPs, 1 and 2 located nearby one another, and themother is A at SNP 1 and A at SNP 2 on one homolog, and B at SNP 1 and Bat SNP 2 on homolog two. If the father is A for both SNPs on bothhomologs, and a B is measured for the fetus SNP 1, this indicates thathomolog two has been inherited by the fetus, and therefore that there isa much higher likelihood of a B being present in the fetus at SNP 2. Amodel that takes into account linkage can predict this, while a modelthat does not take linkage into account cannot. Alternately, if a motheris AB at SNP 1 and AB at nearby SNP 2, then two hypotheses correspondingto maternal trisomy at that location can be used—one involving amatching copy error (nondisjunction in meiosis II or mitosis in earlyfetal development), and one involving an unmatching copy error(nondisjunction in meiosis I). In the case of a matching copy errortrisomy, if the fetus inherited an AA from the mother at SNP 1, then thefetus is much more likely to inherit either an AA or BB from the motherat SNP 2, but not AB. In the case of an unmatching copy error, the fetusinherits an AB from the mother at both SNPs. The allele distributionhypotheses made by a CNV calling method that takes into account linkagecan make these predictions, and therefore correspond to the actualallele measurements to a considerably greater extent than a CNV callingmethod that does not take into account linkage.

In some embodiments, phased genetic data is used to determine if thereis an overrepresentation of the number of copies of a first homologouschromosome segment as compared to a second homologous chromosome segmentin the genome of an individual (such as in the genome of one or morecells or in cfDNA or cfRNA). Exemplary overrepresentations include theduplication of the first homologous chromosome segment or the deletionof the second homologous chromosome segment. In some embodiments, thereis not an overrepresentation since the first and homologous chromosomesegments are present in equal proportions (such as one copy of eachsegment in a diploid sample). In some embodiments, calculated alleleratios in a nucleic acid sample are compared to expected allele ratiosto determine if there is an overrepresentation as described furtherbelow. In this specification the phrase “a first homologous chromosomesegment as compared to a second homologous chromosome segment” means afirst homolog of a chromosome segment and a second homolog of thechromosome segment.

In some embodiments, the method includes obtaining phased genetic datafor the first homologous chromosome segment comprising the identity ofthe allele present at that locus on the first homologous chromosomesegment for each locus in a set of polymorphic loci on the firsthomologous chromosome segment, obtaining phased genetic data for thesecond homologous chromosome segment comprising the identity of theallele present at that locus on the second homologous chromosome segmentfor each locus in the set of polymorphic loci on the second homologouschromosome segment, and obtaining measured genetic allelic datacomprising, for each of the alleles at each of the loci in the set ofpolymorphic loci, the amount of each allele present in a sample of DNAor RNA from one or more target cells and one or more non-target cellsfrom the individual. In some embodiments, the method includesenumerating a set of one or more hypotheses specifying the degree ofoverrepresentation of the first homologous chromosome segment;calculating, for each of the hypotheses, expected genetic data for theplurality of loci in the sample from the obtained phased genetic datafor one or more possible ratios of DNA or RNA from the one or moretarget cells to the total DNA or RNA in the sample; calculating (such ascalculating on a computer) for each possible ratio of DNA or RNA and foreach hypothesis, the data fit between the obtained genetic data of thesample and the expected genetic data for the sample for that possibleratio of DNA or RNA and for that hypothesis; ranking one or more of thehypotheses according to the data fit; and selecting the hypothesis thatis ranked the highest, thereby determining the degree ofoverrepresentation of the number of copies of the first homologouschromosome segment in the genome of one or more cells from theindividual.

In one aspect, the invention features a method for determining a numberof copies of a chromosome or chromosome segment of interest in thegenome of a fetus. In some embodiments, the method includes obtainingphased genetic data for at least one biological parent of the fetus,wherein the phased genetic data comprises the identity of the allelepresent for each locus in a set of polymorphic loci on a firsthomologous chromosome segment and a second homologous chromosome segmentin the parent. In some embodiments, the method includes obtaininggenetic data at the set of polymorphic loci on the chromosome orchromosome segment in a mixed sample of DNA or RNA comprising fetal DNAor RNA and maternal DNA or RNA from the mother of the fetus by measuringthe quantity of each allele at each locus. In some embodiments, themethod includes enumerating a set of one or more hypotheses specifyingthe number of copies of the chromosome or chromosome segment of interestpresent in the genome of the fetus. In some embodiments, the methodincludes creating (such as creating on a computer) for each of thehypotheses, a probability distribution of the expected quantity of eachallele at each of the plurality of loci in mixed sample from the (i) theobtained phased genetic data from the parent(s) and optionally (ii) theprobability of one or more cross overs that may have occurred during theformation of a gamete that contributed a copy of the chromosome orchromosome segment of interest to the fetus; calculating (such ascalculating on a computer) a fit, for each of the hypotheses, between(1) the obtained genetic data of the mixed sample and (2) theprobability distribution of the expected quantity of each allele at eachof the plurality of loci in mixed sample for that hypothesis; rankingone or more of the hypotheses according to the data fit; and selectingthe hypothesis that is ranked the highest, thereby determining thenumber of copies of the chromosome segment of interest in the genome ofthe fetus.

In some embodiments, the method involves obtaining phased genetic datausing any of the methods described herein or any known method. In someembodiments, the method involves simultaneously or sequentially in anyorder (i) obtaining phased genetic data for the first homologouschromosome segment comprising the identity of the allele present at thatlocus on the first homologous chromosome segment for each locus in a setof polymorphic loci on the first homologous chromosome segment, (ii)obtaining phased genetic data for the second homologous chromosomesegment comprising the identity of the allele present at that locus onthe second homologous chromosome segment for each locus in the set ofpolymorphic loci on the second homologous chromosome segment, and (iii)obtaining measured genetic allelic data comprising the amount of eachallele at each of the loci in the set of polymorphic loci in a sample ofDNA from one or more cells from the individual.

In some embodiments, the method involves calculating allele ratios forone or more loci in the set of polymorphic loci that are heterozygous inat least one cell from which the sample was derived (such as the locithat are heterozygous in the fetus and/or heterozygous in the mother).In some embodiments, the calculated allele ratio for a particular locusis the measured quantity of one of the alleles divided by the totalmeasured quantity of all the alleles for the locus. In some embodiments,the calculated allele ratio for a particular locus is the measuredquantity of one of the alleles (such as the allele on the firsthomologous chromosome segment) divided by the measured quantity of oneor more other alleles (such as the allele on the second homologouschromosome segment) for the locus. The calculated allele ratios may becalculated using any of the methods described herein or any standardmethod (such as any mathematical transformation of the calculated alleleratios described herein).

In some embodiments, the method involves determining if there is anoverrepresentation of the number of copies of the first homologouschromosome segment by comparing one or more calculated allele ratios fora locus to an allele ratio that is expected for that locus if the firstand second homologous chromosome segments are present in equalproportions. In some embodiments, the expected allele ratio assumes thepossible alleles for a locus have an equal likelihood of being present.In some embodiments in which the calculated allele ratio for aparticular locus is the measured quantity of one of the alleles dividedby the total measured quantity of all the alleles for the locus, thecorresponding expected allele ratio is 0.5 for a biallelic locus, or 1/3for a triallelic locus. In some embodiments, the expected allele ratiois the same for all the loci, such as 0.5 for all loci. In someembodiments, the expected allele ratio assumes that the possible allelesfor a locus can have a different likelihood of being present, such asthe likelihood based on the frequency of each of the alleles in aparticular population that the subject belongs in, such as a populationbased on the ancestry of the subject. Such allele frequencies arepublicly available (see, e.g., HapMap Project; Perlegen Human HaplotypeProject; web at ncbi.nlm.nih.gov/projects/SNP/; Sherry S T, Ward M H,Kholodov M, et al. dbSNP: the NCBI database of genetic variation.Nucleic Acids Res. 2001 Jan. 1; 29(1):308-11, which are eachincorporated by reference in its entirety). In some embodiments, theexpected allele ratio is the allele ratio that is expected for theparticular individual being tested for a particular hypothesisspecifying the degree of overrepresentation of the first homologouschromosome segment. For example, the expected allele ratio for aparticular individual may be determined based on phased or unphasedgenetic data from the individual (such as from a sample from theindividual that is unlikely to have a deletion or duplication such as anoncancerous sample) or data from one or more relatives from theindividual. In some embodiments for prenatal testing, the expectedallele ratio is the allele ratio that is expected for a mixed samplethat includes DNA or RNA from the pregnant mother and the fetus (such asa maternal plasma or serum sample that includes cfDNA from the motherand cfDNA from the fetus) for a particular hypothesis specifying thedegree of overrepresentation of the first homologous chromosome segment.For example, the expected allele ratio for the mixed sample may bedetermined based on genetic data from the mother and predicted geneticdata for the fetus (such as predictions for alleles that the fetus mayhave inherited from the mother and/or father). In some embodiments,phased or unphased genetic data from a sample of DNA or RNA from onlythe mother (such as the buffy coat from a maternal blood sample) is todetermine the alleles from the maternal DNA or RNA in the mixed sampleas well as alleles that the fetus may have been inherited from themother (and thus may be present in the fetal DNA or RNA in the mixedsample). In some embodiments, phased or unphased genetic data from asample of DNA or RNA from only the father is used to determine thealleles that the fetus may have been inherited from the father (and thusmay be present in the fetal DNA or RNA in the mixed sample). Theexpected allele ratios may be calculated using any of the methodsdescribed herein or any standard method (such as any mathematicaltransformation of the expected allele ratios described herein) (U.S.Publication No 2012/0270212, filed Nov. 18, 2011, which is herebyincorporated by reference in its entirety).

In some embodiments, a calculated allele ratio is indicative of anoverrepresentation of the number of copies of the first homologouschromosome segment if either (i) the allele ratio for the measuredquantity of the allele present at that locus on the first homologouschromosome divided by the total measured quantity of all the alleles forthe locus is greater than the expected allele ratio for that locus, or(ii) the allele ratio for the measured quantity of the allele present atthat locus on the second homologous chromosome divided by the totalmeasured quantity of all the alleles for the locus is less than theexpected allele ratio for that locus. In some embodiments, a calculatedallele ratio is only considered indicative of overrepresentation if itis significantly greater or lower than the expected ratio for thatlocus. In some embodiments, a calculated allele ratio is indicative ofno overrepresentation of the number of copies of the first homologouschromosome segment if either (i) the allele ratio for the measuredquantity of the allele present at that locus on the first homologouschromosome divided by the total measured quantity of all the alleles forthe locus is less than or equal to the expected allele ratio for thatlocus, or (ii) the allele ratio for the measured quantity of the allelepresent at that locus on the second homologous chromosome divided by thetotal measured quantity of all the alleles for the locus is greater thanor equal to the expected allele ratio for that locus. In someembodiments, calculated ratios equal to the corresponding expected ratioare ignored (since they are indicative of no overrepresentation).

In various embodiments, one or more of the following methods is used tocompare one or more of the calculated allele ratios to the correspondingexpected allele ratio(s). In some embodiments, one determines whetherthe calculated allele ratio is above or below the expected allele ratiofor a particular locus irrespective of the magnitude of the difference.In some embodiments, one determines the magnitude of the differencebetween the calculated allele ratio and the expected allele ratio for aparticular locus irrespective of whether the calculated allele ratio isabove or below the expected allele ratio. In some embodiments, onedetermines whether the calculated allele ratio is above or below theexpected allele ratio and the magnitude of the difference for aparticular locus. In some embodiments, one determines whether theaverage or weighted average value of the calculated allele ratios isabove or below the average or weighted average value of the expectedallele ratios irrespective of the magnitude of the difference. In someembodiments, one determines the magnitude of the difference between theaverage or weighted average value of the calculated allele ratios andthe average or weighted average value of the expected allele ratiosirrespective of whether the average or weighted average of thecalculated allele ratio is above or below the average or weightedaverage value of the expected allele ratio. In some embodiments, onedetermines whether the average or weighted average value of thecalculated allele ratios is above or below the average or weightedaverage value of the expected allele ratios and the magnitude of thedifference. In some embodiments, one determines an average or weightedaverage value of the magnitude of the difference between the calculatedallele ratios and the expected allele ratios.

In some embodiments, the magnitude of the difference between thecalculated allele ratio and the expected allele ratio for one or moreloci is used to determine whether the overrepresentation of the numberof copies of the first homologous chromosome segment is due to aduplication of the first homologous chromosome segment or a deletion ofthe second homologous chromosome segment in the genome of one or more ofthe cells.

In some embodiments, an overrepresentation of the number of copies ofthe first homologous chromosome segment is determined to be present ifone or more of following conditions is met. In some embodiments, thenumber of calculated allele ratios that are indicative of anoverrepresentation of the number of copies of the first homologouschromosome segment is above a threshold value. In some embodiments, thenumber of calculated allele ratios that are indicative of nooverrepresentation of the number of copies of the first homologouschromosome segment is below a threshold value. In some embodiments, themagnitude of the difference between the calculated allele ratios thatare indicative of an overrepresentation of the number of copies of thefirst homologous chromosome segment and the corresponding expectedallele ratios is above a threshold value. In some embodiments, for allcalculated allele ratios that are indicative of overrepresentation, thesum of the magnitude of the difference between a calculated allele ratioand the corresponding expected allele ratio is above a threshold value.In some embodiments, the magnitude of the difference between thecalculated allele ratios that are indicative of no overrepresentation ofthe number of copies of the first homologous chromosome segment and thecorresponding expected allele ratios is below a threshold value. In someembodiments, the average or weighted average value of the calculatedallele ratios for the measured quantity of the allele present on thefirst homologous chromosome divided by the total measured quantity ofall the alleles for the locus is greater than the average or weightedaverage value of the expected allele ratios by at least a thresholdvalue. In some embodiments, the average or weighted average value of thecalculated allele ratios for the measured quantity of the allele presenton the second homologous chromosome divided by the total measuredquantity of all the alleles for the locus is less than the average orweighted average value of the expected allele ratios by at least athreshold value. In some embodiments, the data fit between thecalculated allele ratios and allele ratios that are predicted for anoverrepresentation of the number of copies of the first homologouschromosome segment is below a threshold value (indicative of a good datafit). In some embodiments, the data fit between the calculated alleleratios and allele ratios that are predicted for no overrepresentation ofthe number of copies of the first homologous chromosome segment is abovea threshold value (indicative of a poor data fit).

In some embodiments, an overrepresentation of the number of copies ofthe first homologous chromosome segment is determined to be absent ifone or more of following conditions is met. In some embodiments, thenumber of calculated allele ratios that are indicative of anoverrepresentation of the number of copies of the first homologouschromosome segment is below a threshold value. In some embodiments, thenumber of calculated allele ratios that are indicative of nooverrepresentation of the number of copies of the first homologouschromosome segment is above a threshold value. In some embodiments, themagnitude of the difference between the calculated allele ratios thatare indicative of an overrepresentation of the number of copies of thefirst homologous chromosome segment and the corresponding expectedallele ratios is below a threshold value. In some embodiments, themagnitude of the difference between the calculated allele ratios thatare indicative of no overrepresentation of the number of copies of thefirst homologous chromosome segment and the corresponding expectedallele ratios is above a threshold value. In some embodiments, theaverage or weighted average value of the calculated allele ratios forthe measured quantity of the allele present on the first homologouschromosome divided by the total measured quantity of all the alleles forthe locus minus the average or weighted average value of the expectedallele ratios is less than a threshold value. In some embodiments, theaverage or weighted average value of the expected allele ratios minusthe average or weighted average value of the calculated allele ratiosfor the measured quantity of the allele present on the second homologouschromosome divided by the total measured quantity of all the alleles forthe locus is less than a threshold value. In some embodiments, the datafit between the calculated allele ratios and allele ratios that arepredicted for an overrepresentation of the number of copies of the firsthomologous chromosome segment is above a threshold value. In someembodiments, the data fit between the calculated allele ratios andallele ratios that are predicted for no overrepresentation of the numberof copies of the first homologous chromosome segment is below athreshold value. In some embodiments, the threshold is determined fromempirical testing of samples known to have a CNV of interest and/orsamples known to lack the CNV.

In some embodiments, determining if there is an overrepresentation ofthe number of copies of the first homologous chromosome segment includesenumerating a set of one or more hypotheses specifying the degree ofoverrepresentation of the first homologous chromosome segment. Onexemplary hypothesis is the absence of an overrepresentation since thefirst and homologous chromosome segments are present in equalproportions (such as one copy of each segment in a diploid sample).Other exemplary hypotheses include the first homologous chromosomesegment being duplicated one or more times (such as 1, 2, 3, 4, 5, ormore extra copies of the first homologous chromosome compared to thenumber of copies of the second homologous chromosome segment). Anotherexemplary hypothesis includes the deletion of the second homologouschromosome segment. Yet another exemplary hypothesis is the deletion ofboth the first and the second homologous chromosome segments. In someembodiments, predicted allele ratios for the loci that are heterozygousin at least one cell (such as the loci that are heterozygous in thefetus and/or heterozygous in the mother) are estimated for eachhypothesis given the degree of overrepresentation specified by thathypothesis. In some embodiments, the likelihood that the hypothesis iscorrect is calculated by comparing the calculated allele ratios to thepredicted allele ratios, and the hypothesis with the greatest likelihoodis selected.

In some embodiments, an expected distribution of a test statistic iscalculated using the predicted allele ratios for each hypothesis. Insome embodiments, the likelihood that the hypothesis is correct iscalculated by comparing a test statistic that is calculated using thecalculated allele ratios to the expected distribution of the teststatistic that is calculated using the predicted allele ratios, and thehypothesis with the greatest likelihood is selected.

In some embodiments, predicted allele ratios for the loci that areheterozygous in at least one cell (such as the loci that areheterozygous in the fetus and/or heterozygous in the mother) areestimated given the phased genetic data for the first homologouschromosome segment, the phased genetic data for the second homologouschromosome segment, and the degree of overrepresentation specified bythat hypothesis. In some embodiments, the likelihood that the hypothesisis correct is calculated by comparing the calculated allele ratios tothe predicted allele ratios; and the hypothesis with the greatestlikelihood is selected.

Use of Mixed Samples

It will be understood that for many embodiments, the sample is a mixedsample with DNA or RNA from one or more target cells and one or morenon-target cells. In some embodiments, the target cells are cells thathave a CNV, such as a deletion or duplication of interest, and thenon-target cells are cells that do not have the copy number variation ofinterest (such as a mixture of cells with the deletion or duplication ofinterest and cells without any of the deletions or duplications beingtested). In some embodiments, the target cells are cells that areassociated with a disease or disorder or an increased risk for diseaseor disorder (such as cancer cells), and the non-target cells are cellsthat are not associated with a disease or disorder or an increased riskfor disease or disorder (such as noncancerous cells). In someembodiments, the target cells all have the same CNV. In someembodiments, two or more target cells have different CNVs. In someembodiments, one or more of the target cells has a CNV, polymorphism, ormutation associated with the disease or disorder or an increased riskfor disease or disorder that is not found it at least one other targetcell. In some such embodiments, the fraction of the cells that areassociated with the disease or disorder or an increased risk for diseaseor disorder out of the total cells from a sample is assumed to begreater than or equal to the fraction of the most frequent of theseCNVs, polymorphisms, or mutations in the sample. For example if 6% ofthe cells have a K-ras mutation, and 8% of the cells have a BRAFmutation, at least 8% of the cells are assumed to be cancerous.

In some embodiments, the ratio of DNA (or RNA) from the one or moretarget cells to the total DNA (or RNA) in the sample is calculated. Insome embodiments, a set of one or more hypotheses specifying the degreeof overrepresentation of the first homologous chromosome segment areenumerated. In some embodiments, predicted allele ratios for the locithat are heterozygous in at least one cell (such as the loci that areheterozygous in the fetus and/or heterozygous in the mother) areestimated given the calculated ratio of DNA or RNA and the degree ofoverrepresentation specified by that hypothesis are estimated for eachhypothesis. In some embodiments, the likelihood that the hypothesis iscorrect is calculated by comparing the calculated allele ratios to thepredicted allele ratios, and the hypothesis with the greatest likelihoodis selected.

In some embodiments, an expected distribution of a test statisticcalculated using the predicted allele ratios and the calculated ratio ofDNA or RNA is estimated for each hypothesis. In some embodiments, thelikelihood that the hypothesis is correct is determined by comparing atest statistic calculated using the calculated allele ratios and thecalculated ratio of DNA or RNA to the expected distribution of the teststatistic calculated using the predicted allele ratios and thecalculated ratio of DNA or RNA, and the hypothesis with the greatestlikelihood is selected.

In some embodiments, the method includes enumerating a set of one ormore hypotheses specifying the degree of overrepresentation of the firsthomologous chromosome segment. In some embodiments, the method includesestimating, for each hypothesis, either (i) predicted allele ratios forthe loci that are heterozygous in at least one cell (such as the locithat are heterozygous in the fetus and/or heterozygous in the mother)given the degree of overrepresentation specified by that hypothesis or(ii) for one or more possible ratios of DNA or RNA, an expecteddistribution of a test statistic calculated using the predicted alleleratios and the possible ratio of DNA or RNA from the one or more targetcells to the total DNA or RNA in the sample. In some embodiments, a datafit is calculated by comparing either (i) the calculated allele ratiosto the predicted allele ratios, or (ii) a test statistic calculatedusing the calculated allele ratios and the possible ratio of DNA or RNAto the expected distribution of the test statistic calculated using thepredicted allele ratios and the possible ratio of DNA or RNA. In someembodiments, one or more of the hypotheses are ranked according to thedata fit, and the hypothesis that is ranked the highest is selected. Insome embodiments, a technique or algorithm, such as a search algorithm,is used for one or more of the following steps: calculating the datafit, ranking the hypotheses, or selecting the hypothesis that is rankedthe highest. In some embodiments, the data fit is a fit to abeta-binomial distribution or a fit to a binomial distribution. In someembodiments, the technique or algorithm is selected from the groupconsisting of maximum likelihood estimation, maximum a-posterioriestimation, Bayesian estimation, dynamic estimation (such as dynamicBayesian estimation), and expectation-maximization estimation. In someembodiments, the method includes applying the technique or algorithm tothe obtained genetic data and the expected genetic data.

In some embodiments, the method includes creating a partition ofpossible ratios that range from a lower limit to an upper limit for theratio of DNA or RNA from the one or more target cells to the total DNAor RNA in the sample. In some embodiments, a set of one or morehypotheses specifying the degree of overrepresentation of the firsthomologous chromosome segment are enumerated. In some embodiments, themethod includes estimating, for each of the possible ratios of DNA orRNA in the partition and for each hypothesis, either (i) predictedallele ratios for the loci that are heterozygous in at least one cell(such as the loci that are heterozygous in the fetus and/or heterozygousin the mother) given the possible ratio of DNA or RNA and the degree ofoverrepresentation specified by that hypothesis or (ii) an expecteddistribution of a test statistic calculated using the predicted alleleratios and the possible ratio of DNA or RNA. In some embodiments, themethod includes calculating, for each of the possible ratios of DNA orRNA in the partition and for each hypothesis, the likelihood that thehypothesis is correct by comparing either (i) the calculated alleleratios to the predicted allele ratios, or (ii) a test statisticcalculated using the calculated allele ratios and the possible ratio ofDNA or RNA to the expected distribution of the test statistic calculatedusing the predicted allele ratios and the possible ratio of DNA or RNA.In some embodiments, the combined probability for each hypothesis isdetermined by combining the probabilities of that hypothesis for each ofthe possible ratios in the partition; and the hypothesis with thegreatest combined probability is selected. In some embodiments, thecombined probability for each hypothesis is determining by weighting theprobability of a hypothesis for a particular possible ratio based on thelikelihood that the possible ratio is the correct ratio.

In some embodiments, a technique selected from the group consisting ofmaximum likelihood estimation, maximum a-posteriori estimation, Bayesianestimation, dynamic estimation (such as dynamic Bayesian estimation),and expectation-maximization estimation is used to estimate the ratio ofDNA or RNA from the one or more target cells to the total DNA or RNA inthe sample. In some embodiments, the ratio of DNA or RNA from the one ormore target cells to the total DNA or RNA in the sample is assumed to bethe same for two or more (or all) of the CNVs of interest. In someembodiments, the ratio of DNA or RNA from the one or more target cellsto the total DNA or RNA in the sample is calculated for each CNV ofinterest.

Exemplary Methods for Using Imperfectly Phased Data

It will be understood that for many embodiments, imperfectly phased datais used. For example, it may not be known with 100% certainty whichallele is present for one or more of the loci on the first and/or secondhomologous chromosome segment. In some embodiments, the priors forpossible haplotypes of the individual (such as haplotypes based onpopulation based haplotype frequencies) are used in calculating theprobability of each hypothesis. In some embodiments, the priors forpossible haplotypes are adjusted by either using another method to phasethe genetic data or by using phased data from other subjects (such asprior subjects) to refine population data used for informatics basedphasing of the individual.

In some embodiments, the phased genetic data comprises probabilisticdata for two or more possible sets of phased genetic data, wherein eachpossible set of phased data comprises a possible identity of the allelepresent at each locus in the set of polymorphic loci on the firsthomologous chromosome segment and a possible identity of the allelepresent at each locus in the set of polymorphic loci on the secondhomologous chromosome segment. In some embodiments, the probability forat least one of the hypotheses is determined for each of the possiblesets of phased genetic data. In some embodiments, the combinedprobability for the hypothesis is determined by combining theprobabilities of the hypothesis for each of the possible sets of phasedgenetic data; and the hypothesis with the greatest combined probabilityis selected.

Any of the methods disclosed herein or any known method may be used togenerate imperfectly phased data (such as using population basedhaplotype frequencies to infer the most likely phase) for use in theclaimed methods. In some embodiments, phased data is obtained byprobabilistically combining haplotypes of smaller segments. For example,possible haplotypes can be determined based on possible combinations ofone haplotype from a first region with another haplotype from anotherregion from the same chromosome. The probability that particularhaplotypes from different regions are part of the same, larger haplotypeblock on the same chromosome can be determined using, e.g., populationbased haplotype frequencies and/or known recombination rates between thedifferent regions.

In some embodiments, a single hypothesis rejection test is used for thenull hypothesis of disomy. In some embodiments, the probability of thedisomy hypothesis is calculated, and the hypothesis of disomy isrejected if the probability is below a given threshold value (such asless than 1 in 1,000). If the null hypothesis is rejected, this could bedue to errors in the imperfectly phased data or due to the presence of aCNV. In some embodiments, more accurate phased data is obtained (such asphased data from any of the molecular phasing methods disclosed hereinto obtain actual phased data rather than bioinformatics-based inferredphased data). In some embodiments, the probability of the disomyhypothesis is recalculated using the more accurate phased data todetermine if the disomy hypothesis should still be rejected. Rejectionof this hypothesis indicates that a duplication or deletion of thechromosome segment is present. If desired, the false positive rate canbe altered by adjusting the threshold value.

Further Exemplary Embodiments for Determining Ploidy Using Phased Data

In illustrative embodiments, provided herein is a method for determiningploidy of a chromosomal segment in a sample of an individual. The methodincludes the following steps:

-   -   a. receiving allele frequency data comprising the amount of each        allele present in the sample at each loci in a set of        polymorphic loci on the chromosomal segment;    -   b. generating phased allelic information for the set of        polymorphic loci by estimating the phase of the allele frequency        data;    -   c. generating individual probabilities of allele frequencies for        the polymorphic loci for different ploidy states using the        allele frequency data;    -   d. generating joint probabilities for the set of polymorphic        loci using the individual probabilities and the phased allelic        information; and    -   e. selecting, based on the joint probabilities, a best fit model        indicative of chromosomal ploidy, thereby determining ploidy of        the chromosomal segment.

As disclosed herein, the allele frequency data (also referred to hereinas measured genetic allelic data) can be generated by methods known inthe art. For example, the data can be generated using qPCR ormicroarrays. In one illustrative embodiment, the data is generated usingnucleic acid sequence data, especially high throughput nucleic acidsequence data.

In certain illustrative examples, the allele frequency data is correctedfor errors before it is used to generate individual probabilities. Inspecific illustrative embodiments, the errors that are corrected includeallele amplification efficiency bias. In other embodiments, the errorsthat are corrected include ambient contamination and genotypecontamination. In some embodiments, errors that are corrected includeallele amplification bias, ambient contamination and genotypecontamination.

In certain embodiments, the individual probabilities are generated usinga set of models of both different ploidy states and allelic imbalancefractions for the set of polymorphic loci. In these embodiments, andother embodiments, the joint probabilities are generated by consideringthe linkage between polymorphic loci on the chromosome segment.

Accordingly, in one illustrative embodiment that combines some of theseembodiments, provided herein is a method for detecting chromosomalploidy in a sample of an individual, that includes the following steps:

-   -   a. receiving nucleic acid sequence data for alleles at a set of        polymorphic loci on a chromosome segment in the individual;    -   b. detecting allele frequencies at the set of loci using the        nucleic acid sequence data;    -   c. correcting for allele amplification efficiency bias in the        detected allele frequencies to generate corrected allele        frequencies for the set of polymorphic loci;    -   d. generating phased allelic information for the set of        polymorphic loci by estimating the phase of the nucleic acid        sequence data;    -   e. generating individual probabilities of allele frequencies for        the polymorphic loci for different ploidy states by comparing        the corrected allele frequencies to a set of models of different        ploidy states and allelic imbalance fractions of the set of        polymorphic loci;    -   f. generating joint probabilities for the set of polymorphic        loci by combining the individual probabilities considering the        linkage between polymorphic loci on the chromosome segment; and    -   g. selecting, based on the joint probabilities, the best fit        model indicative of chromosomal aneuploidy.

As disclosed herein, the individual probabilities can be generated usinga set of models or hypothesis of both different ploidy states andaverage allelic imbalance fractions for the set of polymorphic loci. Forexample, in a particularly illustrative example, individualprobabilities are generated by modeling ploidy states of a first homologof the chromosome segment and a second homolog of the chromosomesegment. The ploidy states that are modeled include the following:

-   -   (1) all cells have no deletion or amplification of the first        homolog or the second homolog of the chromosome segment;    -   (2) at least some cells have a deletion of the first homolog or        an amplification of the second homolog of the chromosome        segment; and    -   (3) at least some cells have a deletion of the second homolog or        an amplification of the first homolog of the chromosome segment.

It will be understood that the above models can also be referred to ashypothesis that are used to constrain a model. Therefore, demonstratedabove are 3 hypothesis that can be used.

The average allelic imbalance fractions modeled can include any range ofaverage allelic imbalance that includes the actual average allelicimbalance of the chromosomal segment. For example, in certainillustrative embodiments, the range of average allelic imbalance that ismodeled can be between 0, 0.1, 0.2, 0.25, 0.3, 0.4, 0.5, 0.6, 0.75, 1,2, 2.5, 3, 4, and 5% on the low end, and 1, 2, 2.5, 3, 4, 5, 10, 15, 20,25, 30, 40, 50, 60, 70 80 90, 95, and 99% on the high end. The intervalsfor the modeling with the range can be any interval depending on thecomputing power used and the time allowed for the analysis. For example,0.01, 0.05, 0.02, or 0.1 intervals can be modeled.

In certain illustrative embodiments, the sample has an average allelicimbalance for the chromosomal segment of between 0.4% and 5%. In certainembodiments, the average allelic imbalance is low. In these embodiments,average allelic imbalance is typically less than 10%. In certainillustrative embodiments, the allelic imbalance is between 0.25, 0.3,0.4, 0.5, 0.6, 0.75, 1, 2, 2.5, 3, 4, and 5% on the low end, and 1, 2,2.5, 3, 4, and 5% on the high end. In other exemplary embodiments, theaverage allelic imbalance is between 0.4, 0.45, 0.5, 0.6, 0.7, 0.8, 0.9,or 1.0? on the low end and 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.5, 2.0, 3.0,4.0, or 5.0? on the high end. For example, the average allelic imbalanceof the sample in an illustrative example is between 0.45 and 2.5%. Inanother example, the average allelic imbalance is detected with asensitivity of 0.45, 0.5, 0.6, 0.8, 0.8, 0.9, or 1.0. In An exemplarysample with low allelic imbalance in methods of the present inventioninclude plasma samples from individuals with cancer having circulatingtumor DNA or plasma samples from pregnant females having circulatingfetal DNA.

It will be understood that for SNVs, the proportion of abnormal DNA istypically measured using mutant allele frequency (number of mutantalleles at a locus/total number of alleles at that locus). Since thedifference between the amounts of two homologs in tumours is analogous,we measure the proportion of abnormal DNA for a CNV by the averageallelic imbalance (AAI), defined as |(H1−H2)|/(H1+H2), where Hi is theaverage number of copies of homolog i in the sample and Hi/(H1+H2) isthe fractional abundance, or homolog ratio, of homolog i. The maximumhomolog ratio is the homolog ratio of the more abundant homolog.

Assay drop-out rate is the percentage of SNPs with no reads, estimatedusing all SNPs. Single allele drop-out (ADO) rate is the percentage ofSNPs with only one allele present, estimated using only heterozygousSNPs. Genotype confidence can be determined by fitting a binomialdistribution to the number of reads at each SNP that were B-allelereads, and using the ploidy status of the focal region of the SNP toestimate the probability of each genotype.

For tumor tissue samples, chromosomal aneuploidy (exemplified in thisparagraph by CNVs) can be delineated by transitions between allelefrequency distributions. In plasma samples, CNVs can be identified by amaximum likelihood algorithm that searches for plasma CNVs in regionswhere the tumor sample from the same individual also has CNVs, usinghaplotype information deduced from the tumor sample. This algorithm canmodel expected allelic frequencies across all allelic imbalance ratiosat 0.025% intervals for three sets of hypotheses: (1) all cells arenormal (no allelic imbalance), (2) some/all cells have a homolog 1deletion or homolog 2 amplification, or (3) some/all cells have ahomolog 2 deletion or homolog 1 amplification. The likelihood of eachhypothesis can be determined at each SNP using a Bayesian classifierbased on a beta binomial model of expected and observed allelefrequencies at all heterozygous SNPs, and then the joint likelihoodacross multiple SNPs can be calculated, in certain illustrativeembodiments taking linkage of the SNP loci into consideration, asexemplified herein. The maximum likelihood hypothesis can then beselected.

Consider a chromosomal region with an average of N copies in the tumor,and let c denote the fraction of DNA in plasma derived from the mixtureof normal and tumour cells in a disomic region. AAI is calculated as:

${AAI} = \frac{c{❘{N - 2}❘}}{2 + {c( {N - 2} )}}$

In certain illustrative examples, the allele frequency data is correctedfor errors before it is used to generate individual probabilities.Different types of error and/or bias correction are disclosed herein. Inspecific illustrative embodiments, the errors that are corrected areallele amplification efficiency bias. In other embodiments, the errorsthat are corrected include ambient contamination and genotypecontamination. In some embodiments, errors that are corrected includeallele amplification bias, ambient contamination and genotypecontamination.

It will be understood that allele amplification efficiency bias can bedetermined for an allele as part of an experiment or laboratorydetermination that includes an on test sample, or it can be determinedat a different time using a set of samples that include the allele whoseefficiency is being calculated. Ambient contamination and genotypecontamination are typically determined on the same run as the on-testsample analysis.

In certain embodiments, ambient contamination and genotype contaminationare determined for homozygous alleles in the sample. It will beunderstood that for any given sample from an individual some loci in thesample, will be heterozygous and others will be homozygous, even if alocus is selected for analysis because it has a relatively highheterozygosity in the population. It is advantageous in someembodiments, although ploidy of a chromosomal segment may be determinedusing heterozygous loci for an individual, homozygous loci can be usedto calculate ambient and genotype contamination.

In certain illustrative examples, the selecting is performed byanalyzing a magnitude of a difference between the phased allelicinformation and estimated allelic frequencies generated for the models.

In illustrative examples, the individual probabilities of allelefrequencies are generated based on a beta binomial model of expected andobserved allele frequencies at the set of polymorphic loci. Inillustrative examples, the individual probabilities are generated usinga Bayesian classifier.

In certain illustrative embodiments, the nucleic acid sequence data isgenerated by performing high throughput DNA sequencing of a plurality ofcopies of a series of amplicons generated using a multiplexamplification reaction, wherein each amplicon of the series of ampliconsspans at least one polymorphic loci of the set of polymorphic loci andwherein each of the polymeric loci of the set is amplified. In certainembodiments, the multiplex amplification reaction is performed underlimiting primer conditions for at least 1/2 of the reactions. In someembodiments, limiting primer concentrations are used in 1/10, 1/5, 1/4,1/3, 1/2, or all of the reactions of the multiplex reaction. Providedherein are factors to consider to achieve limiting primer conditions inan amplification reaction such as PCR.

In certain embodiments, methods provided herein detect ploidy formultiple chromosomal segments across multiple chromosomes. Accordingly,the chromosomal ploidy in these embodiments is determined for a set ofchromosome segments in the sample. For these embodiments, highermultiplex amplification reactions are needed. Accordingly, for theseembodiments the multiplex amplification reaction can include, forexample, between 2,500 and 50,000 multiplex reactions. In certainembodiments, the following ranges of multiplex reactions are performed:between 100, 200, 250, 500, 1000, 2500, 5000, 10,000, 20,000, 25000,50000 on the low end of the range and between 200, 250, 500, 1000, 2500,5000, 10,000, 20,000, 25000, 50000, and 100,000 on the high end of therange.

In illustrative embodiments, the set of polymorphic loci is a set ofloci that are known to exhibit high heterozygosity. However, it isexpected that for any given individual, some of those loci will behomozygous. In certain illustrative embodiments, methods of theinvention utilize nucleic acid sequence information for both homozygousand heterozygous loci for an individual. The homozygous loci of anindividual are used, for example, for error correction, whereasheterozygous loci are used for the determination of allelic imbalance ofthe sample. In certain embodiments, at least 10% of the polymorphic lociare heterozygous loci for the individual.

As disclosed herein, preference is given for analyzing target SNP locithat are known to be heterozygous in the population. Accordingly, incertain embodiments, polymorphic loci are chosen wherein at least 10,20, 25, 50, 75, 80, 90, 95, 99, or 100% of the polymorphic loci areknown to be heterozygous in the population.

As disclosed herein, in certain embodiments the sample is a plasmasample from a pregnant female.

In some examples, the method further comprises performing the method ona control sample with a known average allelic imbalance ratio. Thecontrol can have an average allelic imbalance ratio for a particularallelic state indicative of aneuploidy of the chromosome segment, ofbetween 0.4 and 10% to mimic an average allelic imbalance of an allelein a sample that is present in low concentrations, such as would beexpected for a circulating free DNA from a fetus or from a tumor.

In some embodiments, PlasmArt controls, as disclosed herein, are used asthe controls. Accordingly, in certain aspects the is a sample generatedby a method comprising fragmenting a nucleic acid sample known toexhibit a chromosomal aneuploidy into fragments that mimic the size offragments of DNA circulating in plasma of the individual. In certainaspects a control is used that has no aneuploidy for the chromosomesegment.

In illustrative embodiments, data from one or more controls can beanalyzed in the method along with a test sample. The controls forexample, can include a different sample from the individual that is notsuspected of containing Chromosomal aneuploidy, or a sample that issuspected of containing CNV or a chromosomal aneuploidy. For example,where a test sample is a plasma sample suspected of containingcirculating free tumor DNA, the method can be also be performed for acontrol sample from a tumor from the subject along with the plasmasample. As disclosed herein, the control sample can be prepared byfragmenting a DNA sample known to exhibit a chromosomal aneuploidy. Suchfragmenting can result in a DNA sample that mimics the DNA compositionof an apoptotic cell, especially when the sample is from an individualafflicted with cancer. Data from the control sample will increase theconfidence of the detection of Chromosomal aneuploidy.

In certain embodiments of the methods of determining ploidy, the sampleis a plasma sample from an individual suspected of having cancer. Inthese embodiments, the method further comprises determining based on theselecting whether copy number variation is present in cells of a tumorof the individual. For these embodiments, the sample can be a plasmasample from an individual. For these embodiments, the method can furtherinclude determining, based on the selecting, whether cancer is presentin the individual.

These embodiments for determining ploidy of a chromosomal segment, canfurther include detecting a single nucleotide variant at a singlenucleotide variance location in a set of single nucleotide variancelocations, wherein detecting either a chromosomal aneuploidy or thesingle nucleotide variant or both, indicates the presence of circulatingtumor nucleic acids in the sample.

These embodiments can further include receiving haplotype information ofthe chromosome segment for a tumor of the individual and using thehaplotype information to generate the set of models of different ploidystates and allelic imbalance fractions of the set of polymorphic loci.

As disclosed herein, certain embodiments of the methods of determiningploidy can further include removing outliers from the initial orcorrected allele frequency data before comparing the initial or thecorrected allele frequencies to the set of models. For example, incertain embodiments, loci allele frequencies that are at least 2 or 3standard deviations above or below the mean value for other loci on thechromosome segment, are removed from the data before being used for themodeling.

As mentioned herein, it will be understood that for many of theembodiments provided herein, including those for determining ploidy of achromosomal segment, imperfectly or perfectly phased data is preferablyused. It will also be understood, that provided herein are a number offeatures that provide improvements over prior methods for detectingploidy, and that many different combinations of these features could beused.

In certain embodiments, as illustrated in FIGS. 69-70 , provided hereinare computer systems and computer readable media to perform any methodsof the present invention. These include systems and computer readablemedia for performing methods of determining ploidy. Accordingly, and asnon-limiting examples of system embodiments, to demonstrate that any ofthe methods provided herein can be performed using a system and acomputer readable medium using the disclosure herein, in another aspect,provided herein is a system for detecting chromosomal ploidy in a sampleof an individual, the system comprising:

-   -   a. an input processor configured to receive allelic frequency        data comprising the amount of each allele present in the sample        at each loci in a set of polymorphic loci on the chromosomal        segment;    -   b. a modeler configured to:        -   i. generate phased allelic information for the set of            polymorphic loci by estimating the phase of the allele            frequency data; and        -   ii. generate individual probabilities of allele frequencies            for the polymorphic loci for different ploidy states using            the allele frequency data; and        -   iii. generate joint probabilities for the set of polymorphic            loci using the individual probabilities and the phased            allelic information; and    -   c. a hypothesis manager configured to select, based on the joint        probabilities, a best fit model indicative of chromosomal        ploidy, thereby determining ploidy of the chromosomal segment.

In certain embodiments of this system embodiment, the allele frequencydata is data generated by a nucleic acid sequencing system. In certainembodiments, the system further comprises an error correction unitconfigured to correct for errors in the allele frequency data, whereinthe corrected allele frequency data is used by the modeler for togenerate individual probabilities. In certain embodiments the errorcorrection unit corrects for allele amplification efficiency bias. Incertain embodiments, the modeler generates the individual probabilitiesusing a set of models of both different ploidy states and allelicimbalance fractions for the set of polymorphic loci. The modeler, incertain exemplary embodiments generates the joint probabilities byconsidering the linkage between polymorphic loci on the chromosomesegment.

In one illustrative embodiment, provided herein is a system fordetecting chromosomal ploidy in a sample of an individual, that includesthe following:

-   -   a. an input processor configured to receive nucleic acid        sequence data for alleles at a set of polymorphic loci on a        chromosome segment in the individual and detect allele        frequencies at the set of loci using the nucleic acid sequence        data;    -   b. an error correction unit configured to correct for errors in        the detected allele frequencies and generate corrected allele        frequencies for the set of polymorphic loci;    -   c. a modeler configured to:        -   i. generate phased allelic information for the set of            polymorphic loci by estimating the phase of the nucleic acid            sequence data;        -   ii. generate individual probabilities of allele frequencies            for the polymorphic loci for different ploidy states by            comparing the phased allelic information to a set of models            of different ploidy states and allelic imbalance fractions            of the set of polymorphic loci; and        -   iii. generate joint probabilities for the set of polymorphic            loci by combining the individual probabilities considering            the relative distance between polymorphic loci on the            chromosome segment; and    -   d. a hypothesis manager configured to select, based on the joint        probabilities, a best fit model indicative of chromosomal        aneuploidy.

In certain exemplary system embodiments provided herein the set ofpolymorphic loci comprises between 1000 and 50,000 polymorphic loci. Incertain exemplary system embodiments provided herein the set ofpolymorphic loci comprises 100 known heterozygosity hot spot loci. Incertain exemplary system embodiments provided herein the set ofpolymorphic loci comprise 100 loci that are at or within 0.5 kb of arecombination hot spot.

In certain exemplary system embodiments provided herein the best fitmodel analyzes the following ploidy states of a first homolog of thechromosome segment and a second homolog of the chromosome segment:

-   -   (1) all cells have no deletion or amplification of the first        homolog or the second homolog of the chromosome segment;    -   (2) some or all cells have a deletion of the first homolog or an        amplification of the second homolog of the chromosome segment;        and    -   (3) some or all cells have a deletion of the second homolog or        an amplification of the first homolog of the chromosome segment.

In certain exemplary system embodiments provided herein the errors thatare corrected comprise allelic amplification efficiency bias,contamination, and/or sequencing errors. In certain exemplary systemembodiments provided herein the contamination comprises ambientcontamination and genotype contamination. In certain exemplary systemembodiments provided herein the ambient contamination and genotypecontamination is determined for homozygous alleles.

In certain exemplary system embodiments provided herein the hypothesismanager is configured to analyze a magnitude of a difference between thephased allelic information and estimated allelic frequencies generatedfor the models. In certain exemplary system embodiments provided hereinthe modeler generates individual probabilities of allele frequenciesbased on a beta binomial model of expected and observed allelefrequencies at the set of polymorphic loci. In certain exemplary systemembodiments provided herein the modeler generates individualprobabilities using a Bayesian classifier.

In certain exemplary system embodiments provided herein the nucleic acidsequence data is generated by performing high throughput DNA sequencingof a plurality of copies of a series of amplicons generated using amultiplex amplification reaction, wherein each amplicon of the series ofamplicons spans at least one polymorphic loci of the set of polymorphicloci and wherein each of the polymeric loci of the set is amplified. Incertain exemplary system embodiments provided herein, wherein themultiplex amplification reaction is performed under limiting primerconditions for at least 1/2 of the reactions. In certain exemplarysystem embodiments provided herein, wherein the sample has an averageallelic imbalance of between 0.4% and 5%.

In certain exemplary system embodiments provided herein, the sample is aplasma sample from an individual suspected of having cancer, and thehypothesis manager is further configured to determine, based on the bestfit model, whether copy number variation is present in cells of a tumorof the individual.

In certain exemplary system embodiments provided herein the sample is aplasma sample from an individual and the hypothesis manager is furtherconfigured to determine, based on the best fit model, that cancer ispresent in the individual. In these embodiments, the hypothesis managercan be further configured to detect a single nucleotide variant at asingle nucleotide variance location in a set of single nucleotidevariance locations, wherein detecting either a chromosomal aneuploidy orthe single nucleotide variant or both, indicates the presence ofcirculating tumor nucleic acids in the sample.

In certain exemplary system embodiments provided herein, the inputprocessor is further configured to receiving haplotype information ofthe chromosome segment for a tumor of the individual, and the modeler isconfigured to use the haplotype information to generate the set ofmodels of different ploidy states and allelic imbalance fractions of theset of polymorphic loci.

In certain exemplary system embodiments provided herein, the modelergenerates the models over allelic imbalance fractions ranging from 0% to25%.

It will be understood that any of the methods provided herein can beexecuted by computer readable code that is stored on nontransitorycomputer readable medium. Accordingly, provided herein in oneembodiment, is a nontransitory computer readable medium for detectingchromosomal ploidy in a sample of an individual, comprising computerreadable code that, when executed by a processing device, causes theprocessing device to:

-   -   a. receive allele frequency data comprising the amount of each        allele present in the sample at each loci in a set of        polymorphic loci on the chromosomal segment;    -   b. generate phased allelic information for the set of        polymorphic loci by estimating the phase of the allele frequency        data;    -   c. generate individual probabilities of allele frequencies for        the polymorphic loci for different ploidy states using the        allele frequency data;    -   d. generate joint probabilities for the set of polymorphic loci        using the individual probabilities and the phased allelic        information; and    -   e. select, based on the joint probabilities, a best fit model        indicative of chromosomal ploidy, thereby determining ploidy of        the chromosomal segment.

In certain computer readable medium embodiments, the allele frequencydata is generated from nucleic acid sequence data. certain computerreadable medium embodiments further comprise correcting for errors inthe allele frequency data and using the corrected allele frequency datafor the generating individual probabilities step. In certain computerreadable medium embodiments the errors that are corrected are alleleamplification efficiency bias. In certain computer readable mediumembodiments the individual probabilities are generated using a set ofmodels of both different ploidy states and allelic imbalance fractionsfor the set of polymorphic loci. In certain computer readable mediumembodiments the joint probabilities are generated by considering thelinkage between polymorphic loci on the chromosome segment.

In one particular embodiment, provided herein is a nontransitorycomputer readable medium for detecting chromosomal ploidy in a sample ofan individual, comprising computer readable code that, when executed bya processing device, causes the processing device to:

-   -   a. receive nucleic acid sequence data for alleles at a set of        polymorphic loci on a chromosome segment in the individual;    -   b. detect allele frequencies at the set of loci using the        nucleic acid sequence data;    -   c. correcting for allele amplification efficiency bias in the        detected allele frequencies to generate corrected allele        frequencies for the set of polymorphic loci;    -   d. generate phased allelic information for the set of        polymorphic loci by estimating the phase of the nucleic acid        sequence data;    -   e. generate individual probabilities of allele frequencies for        the polymorphic loci for different ploidy states by comparing        the corrected allele frequencies to a set of models of different        ploidy states and allelic imbalance fractions of the set of        polymorphic loci;    -   f. generate joint probabilities for the set of polymorphic loci        by combining the individual probabilities considering the        linkage between polymorphic loci on the chromosome segment; and    -   g. select, based on the joint probabilities, the best fit model        indicative of chromosomal aneuploidy.

In certain illustrative computer readable medium embodiments, theselecting is performed by analyzing a magnitude of a difference betweenthe phased allelic information and estimated allelic frequenciesgenerated for the models.

In certain illustrative computer readable medium embodiments theindividual probabilities of allele frequencies are generated based on abeta binomial model of expected and observed allele frequencies at theset of polymorphic loci.

It will be understood that any of the method embodiments provided hereincan be performed by executing code stored on nontransitory computerreadable medium.

Exemplary Embodiments for Detecting Cancer

In certain aspects, the present invention provides a method fordetecting cancer. The sample, it will be understood can be a tumorsample or a liquid sample, such as plasma, from an individual suspectedof having cancer. The methods are especially effective at detectinggenetic mutations such as single nucleotide alterations such as SNVs, orcopy number alterations, such as CNVs in samples with low levels ofthese genetic alterations as a fraction of the total DNA in a sample.Thus the sensitivity for detecting DNA or RNA from a cancer in samplesis exceptional. The methods can combine any or all of the improvementsprovided herein for detecting CNV and SNV to achieve this exceptionalsensitivity.

Accordingly, in certain embodiments provided herein, is a method fordetermining whether circulating tumor nucleic acids are present in asample in an individual, and a nontransitory computer readable mediumcomprising computer readable code that, when executed by a processingdevice, causes the processing device to carry out the method. The methodincludes the following steps:

-   -   c. analyzing the sample to determine a ploidy at a set of        polymorphic loci on a chromosome segment in the individual; and    -   d. determining the level of average allelic imbalance present at        the polymorphic loci based on the ploidy determination, wherein        an average allelic imbalance equal to or greater than 0.4%,        0.45%, 0.5%, 0.6%, 0.7%, 0.75%, 0.8%, 0.9%, or 1% is indicative        of the presence of circulating tumor nucleic acids, such as        ctDNA, in the sample.

In certain illustrative examples, an average allelic imbalance greaterthan 0.4, 0.45, or 0.5% is indicative the presence of ctDNA. In certainembodiments the method for determining whether circulating tumor nucleicacids are present, further comprises detecting a single nucleotidevariant at a single nucleotide variance site in a set of singlenucleotide variance locations, wherein detecting either an allelicimbalance equal to or greater than 0.5% or detecting the singlenucleotide variant, or both, is indicative of the presence ofcirculating tumor nucleic acids in the sample. It will be understoodthat any of the methods provided for detecting chromosomal ploidy or CNVcan be used to determine the level of allelic imbalance, typicallyexpressed as average allelic imbalance. It will be understood that anyof the methods provided herein for detecting an SNV can be used todetect the single nucleotide for this aspect of the present invention.

In certain embodiments the method for determining whether circulatingtumor nucleic acids are present, further comprises performing the methodon a control sample with a known average allelic imbalance ratio. Thecontrol, for example, can be a sample from the tumor of the individual.In some embodiments, the control has an average allelic imbalanceexpected for the sample under analysis. For example, an AAI between 0.5%and 5% or an average allelic imbalance ratio of 0.5%.

In certain embodiments analyzing step in the method for determiningwhether circulating tumor nucleic acids are present, includes analyzinga set of chromosome segments known to exhibit aneuploidy in cancer. Incertain embodiments analyzing step in the method for determining whethercirculating tumor nucleic acids are present, includes analyzing between1,000 and 50,000 or between 100 and 1000, polymorphic loci for ploidy.In certain embodiments analyzing step in the method for determiningwhether circulating tumor nucleic acids are present, includes analyzingbetween 100 and 1000 single nucleotide variant sites. For example, inthese embodiments the analyzing step can include performing a multiplexPCR to amplify amplicons across the 1000 to 50,000 polymeric loci andthe 100 to 1000 single nucleotide variant sites. This multiplex reactioncan be set up as a single reaction or as pools of different subsetmultiplex reactions. The multiplex reaction methods provided herein,such as the massive multiplex PCR disclosed herein provide an exemplaryprocess for carrying out the amplification reaction to help attainimproved multiplexing and therefore, sensitivity levels.

In certain embodiments, the multiplex PCR reaction is carried out underlimiting primer conditions for at least 10%, 20%, 25%, 50%, 75%, 90%,95%, 98%, 99%, or 100% of the reactions. Improved conditions forperforming the massive multiplex reaction provided herein can be used.

In certain aspects, the above method for determining whether circulatingtumor nucleic acids are present in a sample in an individual, and allembodiments thereof, can be carried out with a system. The disclosureprovides teachings regarding specific functional and structural featuresto carry out the methods. As a non-limiting example, the system includesthe following:

-   -   a. An input processor configured to analyze data from the sample        to determine a ploidy at a set of polymorphic loci on a        chromosome segment in the individual; and    -   b. A modeler configured to determine the level of allelic        imbalance present at the polymorphic loci based on the ploidy        determination, wherein an allelic imbalance equal to or greater        than 0.5% is indicative of the presence of circulating.

Exemplary Embodiments for Detecting Single Nucleotide Variants

In certain aspects, provided herein are methods for detecting singlenucleotide variants in a sample. The improved methods provided hereincan achieve limits of detection of 0.015, 0.017, 0.02, 0.05, 0.1, 0.2,0.3, 0.4 or 0.5 percent SNV in a sample. All the embodiments fordetecting SNVs can be carried out with a system. The disclosure providesteachings regarding specific functional and structural features to carryout the methods. Furthermore, provided herein are embodiments comprisinga nontransitory computer readable medium comprising computer readablecode that, when executed by a processing device, causes the processingdevice to carry out the methods for detecting SNVs provided herein.

Accordingly, provided herein in one embodiment, is a method fordetermining whether a single nucleotide variant is present at a set ofgenomic positions in a sample from an individual, the method comprising:

-   -   a. for each genomic position, generating an estimate of        efficiency and a per cycle error rate for an amplicon spanning        that genomic position, using a training data set;    -   b. receiving observed nucleotide identity information for each        genomic position in the sample;    -   c. determining a set of probabilities of single nucleotide        variant percentage resulting from one or more real mutations at        each genomic position, by comparing the observed nucleotide        identity information at each genomic position to a model of        different variant percentages using the estimated amplification        efficiency and the per cycle error rate for each genomic        position independently; and    -   d. determining the most-likely real variant percentage and        confidence from the set of probabilities for each genomic        position.

In illustrative embodiments of the method for determining whether asingle nucleotide variant is present, the estimate of efficiency and theper cycle error rate is generated for a set of amplicons that span thegenomic position. For example, 2, 3, 4, 5, 10, 15, 20, 25, 50, 100 ormore amplicons can be included that span the genomic position.

In illustrative embodiments of the method for determining whether asingle nucleotide variant is present, the observed nucleotide identityinformation comprises an observed number of total reads for each genomicposition and an observed number of variant allele reads for each genomicposition.

In illustrative embodiments of the method for determining whether asingle nucleotide variant is present, the sample is a plasma sample andthe single nucleotide variant is present in circulating tumor DNA of thesample.

In another embodiment provided herein is a method for estimating thepercent of single nucleotide variants that are present in a sample froman individual. The method includes the following steps:

-   -   a. at a set of genomic positions, generating an estimate of        efficiency and a per cycle error rate for one or more amplicon        spanning those genomic positions, using a training data set;    -   b. receiving observed nucleotide identity information for each        genomic position in the sample;    -   c. generating an estimated mean and variance for the total        number of molecules, background error molecules and real        mutation molecules for a search space comprising an initial        percentage of real mutation molecules using the amplification        efficiency and the per cycle error rate of the amplicons; and    -   d. determining the percentage of single nucleotide variants        present in the sample resulting from real mutations by        determining a most-likely real single nucleotide variant        percentage by fitting a distribution using the estimated means        and variances to an observed nucleotide identity information in        the sample.

In illustrative examples of this method for estimating the percent ofsingle nucleotide variants that are present in a sample, the sample is aplasma sample and the single nucleotide variant is present incirculating tumor DNA of the sample.

The training data set for this embodiment of the invention typicallyincludes samples from one or preferably a group of healthy individuals.In certain illustrative embodiments, the training data set is analyzedon the same day or even on the same run as one or more on-test samples.For example, samples from a group of 2, 3, 4, 5, 10, 15, 20, 25, 30, 36,48, 96, 100, 192, 200, 250, 500, 1000 or more healthy individuals can beused to generate the training data set. Where data is available forlarger number of healthy individuals, e.g. 96 or more, confidenceincreases for amplification efficiency estimates even if runs areperformed in advance of performing the method for on-test samples. ThePCR error rate can use nucleic acid sequence information generated notonly for the SNV base location, but for the entire amplified regionaround the SNV, since the error rate is per amplicon. For example, usingsamples from 50 individuals and sequencing a 20 base pair ampliconaround the SNV, error frequency data from 1000 base reads can be used todetermine error frequency rate.

Typically the amplification efficiency is estimating by estimating amean and standard deviation for amplification efficiency for anamplified segment and then fitting that to a distribution model, such asa binomial distribution or a beta binomial distribution. Error rates aredetermined for a PCR reaction with a known number of cycles and then aper cycle error rate is estimated.

In certain illustrative embodiments, estimating the starting moleculesof the test data set further includes updating the estimate of theefficiency for the testing data set using the starting number ofmolecules estimated in step (b) if the observed number of reads issignificantly different than the estimated number of reads. Then theestimate can be updated for a new efficiency and/or starting molecules.

The search space used for estimating the total number of molecules,background error molecules and real mutation molecules can include asearch space from 0.1%, 0.2%, 0.25%, 0.5%, 1%, 2.5%, 5%, 10%, 15%, 20%,or 25% on the low end and 1%, 2%, 2.5%, 5%, 10%, 12.5%, 15%, 20%, 25%,50%, 75%, 90%, or 95% on the high end copies of a base at an SNVposition being the SNV base. Lower ranges, 0.1%, 0.2%, 0.25%, 0.5%, or1% on the low end and 1%, 2%, 2.5%, 5%, 10%, 12.5%, or 15% on the highend can be used in illustrative examples for plasma samples where themethod is detecting circulating tumor DNA. Higher ranges are used fortumor samples.

A distribution is fit to the number of total error molecules (backgrounderror and real mutation) in the total molecules to calculate thelikelihood or probability for each possible real mutation in the searchspace. This distribution could be a binomial distribution or a betabinomial distribution.

The most likely real mutation is determined by determining the mostlikely real mutation percentage and calculating the confidence using thedata from fitting the distribution. As an illustrative example and notintended to limit the clinical interpretation of the methods providedherein, if the mean mutation rate is high then the percent confidenceneeded to make a positive determination of an SNV is lower. For example,if the mean mutation rate for an SNV in a sample using the most likelyhypothesis is 5% and the percent confidence is 99%, then a positive SNVcall would be made. On the other hand for this illustrative example, ifthe mean mutation rate for an SNV in a sample using the most likelyhypothesis is 1% and the percent confidence is 50%, then in certainsituations a positive SNV call would not be made. It will be understoodthat clinical interpretation of the data would be a function ofsensitivity, specificity, prevalence rate, and alternative productavailability.

In one illustrative embodiment, the sample is a circulating DNA sample,such as a circulating tumor DNA sample.

In another embodiment, provided herein is a method for detecting one ormore single nucleotide variants in a test sample from an individual. Themethod according to this embodiment, includes the following steps:

-   -   d. determining a median variant allele frequency for a plurality        of control samples from each of a plurality of normal        individuals, for each single nucleotide variant position in a        set of single nucleotide variance positions based on results        generated in a sequencing run, to identify selected single        nucleotide variant positions having variant median allele        frequencies in normal samples below a threshold value and to        determine background error for each of the single nucleotide        variant positions after removing outlier samples for each of the        single nucleotide variant positions;    -   e. determining an observed depth of read weighted mean and        variance for the selected single nucleotide variant positions        for the test sample based on data generated in the sequencing        run for the test sample; and    -   f. identifying using a computer, one or more single nucleotide        variant positions with a statistically significant depth of read        weighted mean compared to the background error for that        position, thereby detecting the one or more single nucleotide        variants.

In certain embodiments of this method for detecting one or more SNVs thesample is a plasma sample, the control samples are plasma samples, andthe detected one or more single nucleotide variants detected is presentin circulating tumor DNA of the sample. In certain embodiments of thismethod for detecting one or more SNVs the plurality of control samplescomprises at least 25 samples. In certain illustrative embodiments, theplurality of control samples is at least 5, 10, 15, 20, 25, 50, 75, 100,200, or 250 samples on the low end and 10, 15, 20, 25, 50, 75, 100, 200,250, 500, and 1000 samples on the high end.

In certain embodiments of this method for detecting one or more SNVs,outliers are removed from the data generated in the high throughputsequencing run to calculate the observed depth of read weighted mean andobserved variance are determined. In certain embodiments of this methodfor detecting one or more SNVs the depth of read for each singlenucleotide variant position for the test sample is at least 100 reads.

In certain embodiments of this method for detecting one or more SNVs thesequencing run comprises a multiplex amplification reaction performedunder limited primer reaction conditions. Improved methods forperforming multiplex amplification reactions provided herein, are usedto perform these embodiments in illustrative examples.

Not to be limited by theory, methods of the present embodiment utilize abackground error model using normal plasma samples, that are sequencedon the same sequencing run as an on-test sample, to account forrun-specific artifacts. Noisy positions with normal median variantallele frequencies above a threshold, for example >0.1%, 0.2%, 0.25%,0.5% 0.75%, and 1.0%, are removed.

Outlier samples are iteratively removed from the model to account fornoise and contamination. For each base substitution of every genomicloci, the depth of read weighted mean and standard deviation of theerror are calculated. In certain illustrative embodiments, samples, suchas tumor or cell-free plasma samples, with single nucleotide variantpositions with at least a threshold number of reads, for example, atleast 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 50, 100, 250, 500, or 1000variant reads and al Z-score greater than 2.5, 5, 7.5 or 10 against thebackground error model in certain embodiments, are counted as acandidate mutation.

In certain embodiments, a depth of read of greater than 100, 250, 500,1,000, 2000, 2500, 5000, 10,000, 20,000, 25,000, 50,000, or 100,000 onthe low end of the range and 2000, 2500, 5,000, 7,500, 10,000, 25,000,50,000, 100,000, 250,000 or 500,000 reads on the high end, is attainedin the sequencing run for each single nucleotide variant position in theset of single nucleotide variant positions. Typically, the sequencingrun is a high throughput sequencing run. The mean or median valuesgenerated for the on-test samples, in illustrative embodiments areweighted by depth of reads. Therefore, the likelihood that a variantallele determination is real in a sample with 1 variant allele detectedin 1000 reads is weighed higher than a sample with 1 variant alleledetected in 10,000 reads. Since determinations of a variant allele (i.e.mutation) are not made with 100% confidence, the identified singlenucleotide variant can be considered a candidate variant or a candidatemutations.

Exemplary Test Statistic for Analysis of Phased Data

An exemplary test statistic is described below for analysis of phaseddata from a sample known or suspected of being a mixed sample containingDNA or RNA that originated from two or more cells that are notgenetically identical. Let f denote the fraction of DNA or RNA ofinterest, for example the fraction of DNA or RNA with a CNV of interest,or the fraction of DNA or RNA from cells of interest, such as cancercells. In some embodiments for prenatal testing, f denotes the fractionof fetal DNA, RNA, or cells in a mixture of fetal and maternal DNA, RNA,or cells. Note that this refers to the fraction of DNA from cells ofinterest assuming two copies of DNA are given by each cell of interest.This differs from the DNA fraction from cells of interest at a segmentthat is deleted or duplicated.

The possible allelic values of each SNP are denoted A and B. AA, AB, BA,and BB are used to denote all possible ordered allele pairs. In someembodiments, SNPs with ordered alleles AB or BA are analyzed. Let N_(i)denote the number of sequence reads of the ith SNP, and A_(i) and B_(i)denote the number of reads of the ith SNP that indicate allele A and B,respectively. It is assumed:

N _(i) =A _(i) +B _(i).

The allele ratio R_(i) is defined:

$R_{i}\overset{\Delta}{=}{\frac{A_{i}}{N_{i}}.}$

Let T denote the number of SNPs targeted.

Without loss of generality, some embodiments focus on a singlechromosome segment. As a matter of further clarity, in thisspecification the phrase “a first homologous chromosome segment ascompared to a second homologous chromosome segment” means a firsthomolog of a chromosome segment and a second homolog of the chromosomesegment. In some such embodiments, all of the target SNPs are containedin the segment chromosome of interest. In other embodiments, multiplechromosome segments are analyzed for possible copy number variations.

MAP Estimation

This method leverages the knowledge of phasing via ordered alleles todetect the deletion or duplication of the target segment. For each SNPi, define

$X_{i}\overset{\Delta}{=}\{ \begin{matrix}1 & {R_{i} < {0.5{and}{SNP}i{AB}}} \\0 & {R_{i} \geq {0.5{and}{SNP}i{AB}}} \\0 & {R_{i} < {0.5{and}{SNP}i{BA}}} \\1 & {R_{i} \geq {0.5{and}{SNP}i{BA}}}\end{matrix} $

Then define

$S\overset{\Delta}{=}{\sum\limits_{{All}{SNPs}}{X_{i}.}}$

The distributions of the X_(i) and S under various copy numberhypotheses (such as hypotheses for disomy, deletion of the first orsecond homolog, or duplication of the first or second homolog) aredescribed below.

Disomy Hypothesis

Under the hypothesis that the target segment is not deleted orduplicated,

$X_{i} = {\{ \begin{matrix}0 & {{{wp}1} - {p( {\frac{1}{2},N_{i}} )}} \\1 & {{wpp}( {\frac{1}{2},N_{i}} )}\end{matrix} \begin{matrix}\  \\\ \end{matrix}}$ where${p( {b,n} )}\overset{\Delta}{=}{\Pr{\{ {X \sim {Bin{o( {b,n} )}} \geq \frac{n}{2}} \}.}}$

If we assume a constant depth of read N, this gives us a Binomialdistribution S with parameters

$p( {\frac{1}{2},N} )$

and T. Deletion Hypotheses

Under the hypothesis that the first homolog is deleted (i.e., an AB SNPbecomes B, and a BA SNP becomes A), then R_(i) has a Binomialdistribution with parameters

$1 - \frac{1}{2 - f}$

and for AB SNPs, and

$\frac{1}{2 - f}$

and T for BA SNPs. Therefore,

$X_{i} = \{ \begin{matrix}0 & {{{wp}1} - {p( {\frac{1}{2 - f},N_{i}} )}} \\1 & {{wpp}( {\frac{1}{2 - f},N_{i}} )}\end{matrix} $

If we assume a constant depth of read N, this gives a Binomialdistribution S with parameters

$p( {\frac{1}{2 - f},N} )$

and T.

Under the hypothesis that the second homolog is deleted (i.e., an AB SNPbecomes A, and a BA SNP becomes B), then R_(i) has a Binomialdistribution with parameters

$\frac{1}{2 - f}$

and T for AB SNPs, and

$1 - \frac{1}{2 - f}$

and T for BA SNPs. Therefore,

$X_{i} = \{ \begin{matrix}0 & {{wp}{p( {\frac{1}{2 - f},N_{i}} )}} \\1 & {{{wp}1} - {p( {\frac{1}{2 - f},N_{i}} )}}\end{matrix} $

If we assume a constant depth of read N, this gives a Binomialdistribution S with parameters

$1 - {p( {\frac{1}{2 - f},N} )}$

and T. Duplication Hypotheses

Under the hypothesis that the first homolog is duplicated (i.e., an ABSNP becomes AAB, and a BA SNP becomes BBA), then R_(i) has a Binomialdistribution with parameters

$\frac{1 + f}{2 + f}$

and T for AB SNPs, and

$1 - \frac{1 + f}{2 + f}$

and T for BA SNPs. Therefore,

$X_{i} = \{ \begin{matrix}0 & {{wp}{p( {\frac{1 + f}{2 + f},N_{i}} )}} \\1 & {{{wp}1} - {p( {\frac{1 + f}{2 + f},N_{i}} )}}\end{matrix} $

If we assume a constant depth of read N, this gives us a Binomialdistribution S with parameters

$1 - {p( {\frac{1 + f}{2 + f},N} )}$

and T.

Under the hypothesis that the second homolog is duplicated (i.e., an ABSNP becomes ABB, and a BA SNP becomes BAA), then R_(i) has a Binomialdistribution with parameters

$1 - \frac{1 + f}{2 + f}$

and T for AB SNPs, and

$\frac{1 + f}{2 + f}$

and T for BA SNPs. Therefore,

$X_{i} = \{ \begin{matrix}0 & {{{wp}1} - {p( {\frac{1 + f}{2 + f},\ N_{i}} )}} \\1 & {{wp}p( {\frac{1 + f}{2 + f},N_{i}} )}\end{matrix} $

If we assume a constant depth of read N, this gives a Binomialdistribution S with parameters

$p( {\frac{1 + f}{2 + f},N} )$

and T. Classification

As demonstrated in the sections above, X_(i) is a binary random variablewith

${\Pr\{ {X_{1} = 1} \}} = \{ \begin{matrix}{p( {\frac{1}{2},N_{i}} )} & {{given}{disomy}} \\{p\ ( {\frac{1}{2 - f},\ N_{i}} )} & {{homolog}1{deletion}} \\{1 - {p( {\frac{1}{2 - f},\ N_{i}} )}} & {{homolog}2{deletion}} \\{1 - {p( {\frac{1 + f}{2 + f},\ N_{i}} )}} & {{homolog}1{duplication}} \\{p( {\frac{1 + f}{2 + f},\ N_{i}} )} & {{homolog}2{duplication}}\end{matrix} $

This allows one to calculate the probability of the test statistic Sunder each hypothesis. The probability of each hypothesis given themeasured data can be calculated. In some embodiments, the hypothesiswith the greatest probability is selected. If desired, the distributionon S can be simplified by either approximating each N_(i) with aconstant depth of reach N or by truncating the depth of reads to aconstant N. This simplification gives

$S \sim \{ \begin{matrix}{{Bino}( {{p( {\frac{1}{2},N_{i}} )},T} )} & {{given}{disomy}} \\{{Bino}( {{p\ ( {\frac{1}{2 - f},\ N} )},T} )} & {{homolog}1{deletion}} \\{{Bino}( {{1 - {p( {\frac{1}{2 - f},\ N} )}},T} )} & {{homolog}2{deletion}} \\{{Bino}( {{1 - {p( {\frac{1 + f}{2 + f},\ N} )}},T} )} & {{homolog}1{duplication}} \\{{Bino}( {{p( {\frac{1 + f}{2 + f},\ N} )},T} )} & {{homolog}2{duplication}}\end{matrix} $

The value for f can be estimate by selecting the most likely value offgiven the measured data, such as the value of f that generates the bestdata fit using an algorithm (e.g., a search algorithm) such as maximumlikelihood estimation, maximum a-posteriori estimation, or Bayesianestimation. In some embodiments, multiple chromosome segments areanalyzed and a value for f is estimated based on the data for eachsegment. If all the target cells have these duplications or deletions,the estimated values for f based on data for these different segmentsare similar. In some embodiments, f is experimentally measured such asby determining the fraction of DNA or RNA from cancer cells based onmethylation differences (hypomethylation or hypermethylation) betweencancer and non-cancerous DNA or RNA.

In some embodiments for mixed samples of fetal and maternal nucleicacids, the value of f is the fetal fraction, that is the fraction offetal DNA (or RNA) out of the total amount of DNA (or RNA) in thesample. In some embodiments, the fetal fraction is determined byobtaining genotypic data from a maternal blood sample (or fractionthereof) for a set of polymorphic loci on at least one chromosome thatis expected to be disomic in both the mother and the fetus; creating aplurality of hypotheses each corresponding to different possible fetalfractions at the chromosome; building a model for the expected allelemeasurements in the blood sample at the set of polymorphic loci on thechromosome for possible fetal fractions; calculating a relativeprobability of each of the fetal fractions hypotheses using the modeland the allele measurements from the blood sample or fraction thereof;and determining the fetal fraction in the blood sample by selecting thefetal fraction corresponding to the hypothesis with the greatestprobability. In some embodiments, the fetal fraction is determined byidentifying those polymorphic loci where the mother is homozygous for afirst allele at the polymorphic locus, and the father is (i)heterozygous for the first allele and a second allele or (ii) homozygousfor a second allele at the polymorphic locus; and using the amount ofthe second allele detected in the blood sample for each of theidentified polymorphic loci to determine the fetal fraction in the bloodsample (see, e.g., US Publ. No. 2012/0185176, filed Mar. 29, 2012, andUS Pub. No. 2014/0065621, filed Mar. 13, 2013 which are eachincorporated herein by reference in their entirety).

Another method for determining fetal fraction includes using a highthroughput DNA sequencer to count alleles at a large number ofpolymorphic (such as SNP) genetic loci and modeling the likely fetalfraction (see, for example, US Publ. No. 2012/0264121, which isincorporated herein by reference in its entirety). Another method forcalculating fetal fraction can be found in Sparks et al., “Noninvasiveprenatal detection and selective analysis of cell-free DNA obtained frommaternal blood: evaluation for trisomy 21 and trisomy 18,” Am J ObstetGynecol 2012; 206:319.e1-9, which is incorporated herein by reference inits entirety. In some embodiments, fetal fraction is determined using amethylation assay (see, e.g., U.S. Pat. Nos. 7,754,428; 7,901,884; and8,166,382, which are each incorporated herein by reference in theirentirety) that assumes certain loci are methylated or preferentiallymethylated in the fetus, and those same loci are unmethylated orpreferentially unmethylated in the mother.

FIGS. 1A-13D are graphs showing the distribution of the test statistic Sdivided by T (the number of SNPs) (“S/T”) for various copy numberhypotheses for various depth of reads and tumor fractions (where f isthe fraction of tumor DNA out of total DNA) for an increasing number ofSNPs.

Single Hypothesis Rejection

The distribution of S for the disomy hypothesis does not depend on f.Thus, the probability of the measured data can be calculated for thedisomy hypothesis without calculating f. A single hypothesis rejectiontest can be used for the null hypothesis of disomy. In some embodiments,the probability of S under the disomy hypothesis is calculated, and thehypothesis of disomy is rejected if the probability is below a giventhreshold value (such as less than 1 in 1,000). This indicates that aduplication or deletion of the chromosome segment is present. Ifdesired, the false positive rate can be altered by adjusting thethreshold value.

Exemplary Methods for Analysis of Phased Data

Exemplary methods are described below for analysis of data from a sampleknown or suspected of being a mixed sample containing DNA or RNA thatoriginated from two or more cells that are not genetically identical. Insome embodiments, phased data is used. In some embodiments, the methodinvolves determining, for each calculated allele ratio, whether thecalculated allele ratio is above or below the expected allele ratio andthe magnitude of the difference for a particular locus. In someembodiments, a likelihood distribution is determined for the alleleratio at a locus for a particular hypothesis and the closer thecalculated allele ratio is to the center of the likelihood distribution,the more likely the hypothesis is correct. In some embodiments, themethod involves determining the likelihood that a hypothesis is correctfor each locus. In some embodiments, the method involves determining thelikelihood that a hypothesis is correct for each locus, and combiningthe probabilities of that hypothesis for each locus, and the hypothesiswith the greatest combined probability is selected. In some embodiments,the method involves determining the likelihood that a hypothesis iscorrect for each locus and for each possible ratio of DNA or RNA fromthe one or more target cells to the total DNA or RNA in the sample. Insome embodiments, a combined probability for each hypothesis isdetermined by combining the probabilities of that hypothesis for eachlocus and each possible ratio, and the hypothesis with the greatestcombined probability is selected.

In one embodiment, the following hypotheses are considered: H₁₁ (allcells are normal), H₁₀ (presence of cells with only homolog 1, hencehomolog 2 deletion), H₀₁ (presence of cells with only homolog 2, hencehomolog 1 deletion), H₂₁ (presence of cells with homolog 1 duplication),H₁₂ (presence of cells with homolog 2 duplication). For a fraction f oftarget cells such as cancer cells or mosaic cells (or the fraction ofDNA or RNA from the target cells), the expected allele ratio forheterozygous (AB or BA) SNPs can be found as follows:

$\begin{matrix} & {{Equation}(1)}\end{matrix}$ r(AB, H₁₁ = r(BA, H₁₁) = 0.5,${{r( {{AB},H_{10}} )} = {{r( {{BA},H_{01}} )} = \frac{1}{2 - f}}},$${{r( {{AB},H_{01}} )} = {{r( {{BA},H_{10}} )} = \frac{1 - f}{2 - f}}},$${{r( {{AB},H_{21}} )} = {{r( {{BA},H_{12}} )} = \frac{1 + f}{2 + f}}},$${r( {{AB},H_{12}} )} = {{r( {{BA},H_{21}} )} = {\frac{1}{2 + f}.}}$

Bias, Contamination, and Sequencing Error Correction:

The observation D_(s) at the SNP consists of the number of originalmapped reads with each allele present, n_(A) ⁰ and n_(B) ⁰. Then, we canfind the corrected reads n_(A) and n_(B) using the expected bias in theamplification of A and B alleles.

Let c_(a) to denote the ambient contamination (such as contaminationfrom DNA in the air or environment) and r(c_(a)) to denote the alleleratio for the ambient contaminant (which is taken to be 0.5 initially).Moreover, c_(g) denotes the genotyped contamination rate (such as thecontamination from another sample), and r(c_(g)) is the allele ratio forthe contaminant. Let s_(e)(A,B) and s_(e)(B,A) denote the sequencingerrors for calling one allele a different allele (such as by erroneouslydetecting an A allele when a B allele is present).

One can find the observed allele ratio q(r, c_(a), r(c_(a)), c_(g),r(c_(g)), s_(e)(A,B), s_(e)(B,A)) for a given expected allele ratio r bycorrecting for ambient contamination, genotyped contamination, andsequencing error.

Since the contaminant genotypes are unknown, population frequencies canbe used to find P(r(c_(g))). More specifically, let p be the populationfrequency for one of the alleles (which may be referred to as areference allele). Then, we have P(r(c_(g))=0)=(1−p)²,P(r(c_(g))=0)=2p(1−p), and P(r(cg)=0)=p². The conditional expectationover r(c_(g)) can be used to determine the E[q(r, c_(a), r(c_(a)),c_(g), r(c_(g)), s_(e)(A,B), s_(e)(B,A))]. Note that the ambient andgenotyped contamination are determined using the homozygous SNPs, hencethey are not affected by the absence or presence of deletions orduplications. Moreover, it is possible to measure the ambient andgenotyped contamination using a reference chromosome if desired.

Likelihood at each SNP:

The equation below gives the probability of observing n_(A) and n_(B)given an allele ratio r:

$\begin{matrix}{{P( {n_{A}, n_{B} \middle| r } )} = {{p_{bino}( {{n_{A};{n_{A} + n_{B}}},r} )} = {\begin{pmatrix}{n_{A} + n_{B}} \\n_{A}\end{pmatrix}{{r^{n_{A}}( {1 - r} )}^{n_{B}}.}}}} & {{Equation}(2)}\end{matrix}$

Let D_(s) denote the data for SNP s. For each hypothesis hϵ{H₁₁, H₀₁,H₁₀, H₂₁, H₁₂}, one can let r=r(AB,h) or r=r(BA,h) in the equation (1)and find the conditional expectation over r(c_(g)) to determine theobserved allele ratio E[q(r, c_(a), r(c_(a)), c_(g), r(c_(g)))]. Then,letting r=E[q(r, c_(a), r(c_(a)), c_(g), r(c_(g)), s_(e)(A,B),s_(e)(B,A))] in equation (2) one can determine P(D_(s)|h,f).

Search Algorithm:

In some embodiments, SNPs with allele ratios that seem to be outliersare ignored (such as by ignoring or eliminating SNPs with allele ratiosthat are at least 2 or 3 standard deviations above or below the meanvalue). Note that an advantage identified for this approach is that inthe presence of higher mosaicism percentage, the variability in theallele ratios may be high, hence this ensures that SNPs will not betrimmed due to mosaicism.

Let F={f₁, . . . , f_(N)} denote the search space for the mosaicismpercentage (such as the tumor fraction). One can determine P(D_(s)|h,f)at each SNP s and fϵF, and combine the likelihood over all SNPs.

The algorithm goes over each f for each hypothesis. Using a searchmethod, one concludes that mosaicism exists if there is a range F* offwhere the confidence of the deletion or duplication hypothesis is higherthan the confidence of the no deletion and no duplication hypotheses. Insome embodiments, the maximum likelihood estimate for P(D_(s)|h,f) in F*is determined. If desired, the conditional expectation over fϵF* may bedetermined. If desired, the confidence for each hypothesis can bedetermined.

Additional Embodiments

In some embodiments, a beta binomial distribution is used instead ofbinomial distribution. In some embodiments, a reference chromosome orchromosome segment is used to determine the sample specific parametersof beta binomial.

Theoretical Performance Using Simulations:

If desired, one can evaluate the theoretical performance of thealgorithm by randomly assigning number of reference reads to a SNP withgiven depth of read (DOR). For the normal case, use p=0.5 for thebinomial probability parameter, and for deletions or duplications, p isrevised accordingly. Exemplary input parameters for each simulation areas follows: (1) number of SNPs S (2) constant DOR D per SNP, (3) p, and(4) number of experiments.

First Simulation Experiment:

This experiment focused on Sϵ{500, 1000}, Dϵ{500, 1000} and pϵ{0%, 1%,2%, 3%, 4%, 5%}. We performed 1,000 simulation experiments in eachsetting (hence 24,000 experiments with phase, and 24,000 without phase).We simulated the number of reads from a binomial distribution (ifdesired, other distributions can be used). The false positive rate (inthe case of p=0%) and false negative rate (in the case of p>0%) weredetermined both with or without phase information. False positive ratesare listed in FIG. 26 . Note that phase information is very helpful,especially for S=1000, D=1000. Although for S=500, D=500, the algorithmhas the highest false positive rates with or without phase out of theconditions tested. False negative rates are listed in FIG. 27 .

Phase information is particularly useful for low mosaicism percentages(≤3%). Without phase information, a high level of false negatives wereobserved for p=1% because the confidence on deletion is determined byassigning equal chance to H₁₀ and H₀₁, and a small deviation in favor ofone hypothesis is not sufficient to compensate for the low likelihoodfrom the other hypothesis. This applies to duplications as well. Notealso that the algorithm seems to be more sensitive to depth of readcompared to number of SNPs. For the results with phase information, weassume that perfect phase information is available for a high number ofconsecutive heterozygous SNPs. If desired, haplotype information can beobtained by probabilistically combining haplotypes on smaller segments.

Second Simulation Experiment:

This experiment focused on Sϵ{100, 200, 300, 400, 500}, Dϵ{1000, 2000,3000, 4000, 5000} and pϵ{0%, 1%, 1.5%, 2%, 2.5%, 3%} and 10000 randomexperiments at each setting. The false positive rate (in the case ofp=0%) and false negative rate (in the case of p>0%) were determined bothwith or without phase information. The false negative rate is below 10%for D≥3000 and N≥200 using haplotype information, whereas the sameperformance is reached for D=5000 and N≥400 (FIGS. 20A and 20B). Thedifference between the false negative rate was particularly stark forsmall mosaicism percentages (FIGS. 21A-25B). For example, when p=1%, aless than 20% false negative rate is never reached without haplotypedata, whereas it is close to 0% for N≥300 and D≥3000. For p=3%, a 0%false negative rate is observed with haplotype data, while N≥300 andD≥3000 is needed to reach the same performance without haplotype data.

Exemplary Methods for Detecting Deletions and Duplications withoutPhased Data

In some embodiments, unphased genetic data is used to determine if thereis an overrepresentation of the number of copies of a first homologouschromosome segment as compared to a second homologous chromosome segmentin the genome of an individual (such as in the genome of one or morecells or in cfDNA or cfRNA). In some embodiments, phased genetic data isused but the phasing is ignored. In some embodiments, the sample of DNAor RNA is a mixed sample of cfDNA or cfRNA from the individual thatincludes cfDNA or cfRNA from two or more genetically different cells. Insome embodiments, the method utilizes the magnitude of the differencebetween the calculated allele ratio and the expected allele ratio foreach of the loci.

In some embodiments, the method involves obtaining genetic data at a setof polymorphic loci on the chromosome or chromosome segment in a sampleof DNA or RNA from one or more cells from the individual by measuringthe quantity of each allele at each locus. In some embodiments, alleleratios are calculated for the loci that are heterozygous in at least onecell from which the sample was derived (such as the loci that areheterozygous in the fetus and/or heterozygous in the mother). In someembodiments, the calculated allele ratio for a particular locus is themeasured quantity of one of the alleles divided by the total measuredquantity of all the alleles for the locus. In some embodiments, thecalculated allele ratio for a particular locus is the measured quantityof one of the alleles (such as the allele on the first homologouschromosome segment) divided by the measured quantity of one or moreother alleles (such as the allele on the second homologous chromosomesegment) for the locus. The calculated allele ratios and expected alleleratios may be calculated using any of the methods described herein orany standard method (such as any mathematical transformation of thecalculated allele ratios or expected allele ratios described herein).

In some embodiments, a test statistic is calculated based on themagnitude of the difference between the calculated allele ratio and theexpected allele ratio for each of the loci. In some embodiments, thetest statistic Δ is calculated using the following formula

$\Delta = \frac{\Sigma_{{All}{Loci}}( {\delta_{i} - \mu_{i}} )}{\sqrt{\Sigma_{{All}{Loci}}\sigma_{i}^{2}}}$

wherein δ_(i) is the magnitude of the difference between the calculatedallele ratio and the expected allele ratio for the ith loci;

wherein μ_(i) is the mean value of δ_(i); and

wherein σ_(i) ² is the standard deviation of δ_(i).

For example, we can define δ_(i) as follows when the expected alleleratio is 0.5:

$\delta_{i}\overset{\bigtriangleup}{=}| {\frac{1}{2} - R_{i}} \middle| . $

Values for μ_(i) and σ_(i) can be computed using the fact that R_(i) isa Binomial random variable. In some embodiments, the standard deviationis assumed to be the same for all the loci. In some embodiments, theaverage or weighted average value of the standard deviation or anestimate of the standard deviation is used for the value of σ_(i) ². Insome embodiments, the test statistic is assumed to have a normaldistribution. For example, the central limit theorem implies that thedistribution of Δ converges to a standard normal as the number of loci(such as the number of SNPs T) grows large.

In some embodiments, a set of one or more hypotheses specifying thenumber of copies of the chromosome or chromosome segment in the genomeof one or more of the cells are enumerated. In some embodiments, thehypothesis that is most likely based on the test statistic is selected,thereby determining the number of copies of the chromosome or chromosome segment in the genome of one or more of the cells. In someembodiments, a hypotheses is selected if the probability that the teststatistic belongs to a distribution of the test statistic for thathypothesis is above an upper threshold; one or more of the hypotheses isrejected if the probability that the test statistic belongs to thedistribution of the test statistic for that hypothesis is below an lowerthreshold; or a hypothesis is neither selected nor rejected if theprobability that the test statistic belongs to the distribution of thetest statistic for that hypothesis is between the lower threshold andthe upper threshold, or if the probability is not determined withsufficiently high confidence. In some embodiments, an upper and/or lowerthreshold is determined from an empirical distribution, such as adistribution from training data (such as samples with a known copynumber, such as diploid samples or samples known to have a particulardeletion or duplication). Such an empirical distribution can be used toselect a threshold for a single hypothesis rejection test.

Note that the test statistic Δ is independent of S and therefore bothcan be used independently, if desired.

Exemplary Methods for Detecting Deletions and Duplications Using AlleleDistributions or Patterns

This section includes methods for determining if there is anoverrepresentation of the number of copies of a first homologouschromosome segment as compared to a second homologous chromosomesegment. In some embodiments, the method involves enumerating (i) aplurality of hypotheses specifying the number of copies of thechromosome or chromosome segment that are present in the genome of oneor more cells (such as cancer cells) of the individual or (ii) aplurality of hypotheses specifying the degree of overrepresentation ofthe number of copies of a first homologous chromosome segment ascompared to a second homologous chromosome segment in the genome of oneor more cells of the individual. In some embodiments, the methodinvolves obtaining genetic data from the individual at a plurality ofpolymorphic loci (such as SNP loci) on the chromosome or chromosomesegment. In some embodiments, a probability distribution of the expectedgenotypes of the individual for each of the hypotheses is created. Insome embodiments, a data fit between the obtained genetic data of theindividual and the probability distribution of the expected genotypes ofthe individual is calculated. In some embodiments, one or morehypotheses are ranked according to the data fit, and the hypothesis thatis ranked the highest is selected. In some embodiments, a technique oralgorithm, such as a search algorithm, is used for one or more of thefollowing steps: calculating the data fit, ranking the hypotheses, orselecting the hypothesis that is ranked the highest. In someembodiments, the data fit is a fit to a beta-binomial distribution or afit to a binomial distribution. In some embodiments, the technique oralgorithm is selected from the group consisting of maximum likelihoodestimation, maximum a-posteriori estimation, Bayesian estimation,dynamic estimation (such as dynamic Bayesian estimation), andexpectation-maximization estimation. In some embodiments, the methodincludes applying the technique or algorithm to the obtained geneticdata and the expected genetic data.

In some embodiments, the method involves enumerating (i) a plurality ofhypotheses specifying the number of copies of the chromosome orchromosome segment that are present in the genome of one or more cells(such as cancer cells) of the individual or (ii) a plurality ofhypotheses specifying the degree of overrepresentation of the number ofcopies of a first homologous chromosome segment as compared to a secondhomologous chromosome segment in the genome of one or more cells of theindividual. In some embodiments, the method involves obtaining geneticdata from the individual at a plurality of polymorphic loci (such as SNPloci) on the chromosome or chromosome segment. In some embodiments, thegenetic data includes allele counts for the plurality of polymorphicloci. In some embodiments, a joint distribution model is created for theexpected allele counts at the plurality of polymorphic loci on thechromosome or chromosome segment for each hypothesis. In someembodiments, a relative probability for one or more of the hypotheses isdetermined using the joint distribution model and the allele countsmeasured on the sample, and the hypothesis with the greatest probabilityis selected.

In some embodiments, the distribution or pattern of alleles (such as thepattern of calculated allele ratios) is used to determine the presenceor absence of a CNV, such as a deletion or duplication. If desired theparental origin of the CNV can be determined based on this pattern. Amaternally inherited duplication is an extra copy of a chromosomesegment from the mother, and maternally inherited deletion is theabsence of the copy of a chromosome segment from the mother such thatthe only copy of the chromosome segment that is present is from thefather. Exemplary patterns are illustrated in FIGS. 15A-19D and aredescribed further below.

To determine the presence or absence of a deletion of a chromosomesegment of interest, the algorithm considers the distribution ofsequence counts from each of two possible alleles at large number ofSNPs per chromosome. It is important to note that some embodiments ofthe algorithm use an approach that does not lend itself tovisualization. Thus, for the purposes of illustration, the data isdisplayed in FIGS. 15A-18 in a simplified fashion as ratios of the twomost likely alleles, labeled as A and B, so that the relevant trends canbe more readily visualized. This simplified illustration does not takeinto account some of the possible features of the algorithm. Forexample, two embodiments for the algorithm that are not possible toillustrate with a method of visualization that displays allele ratiosare: 1) the ability to leverage linkage disequilibrium, i.e. theinfluence that a measurement at one SNP has on the likely identity of aneighboring SNP, and 2) the use of non-Gaussian data models thatdescribe the expected distribution of allele measurements at a SNP givenplatform characteristics and amplification biases. Also note that asimplified version of the algorithm only considers the two most commonalleles at each SNP, ignoring other possible alleles.

Deletions of interest were detected in genomic and maternal bloodsamples. In some embodiments, the genomic and maternal plasma samplesare analyzed using the multiplex-PCR and sequencing method of Example 1.The genomic DNA syndrome samples tested lacked heterozygous SNPs in thetargeted regions, confirming the ability of the assays to distinguishmonosomy (affected) from disomy (unaffected). Analysis of cfDNA from amaternal blood sample was able to detect 22q11.2 deletion syndrome,Cri-du-Chat deletion syndrome, and Wolf-Hirschhorn deletion syndrome, aswell as the other deletion syndromes in FIG. 14 in the fetus.

FIGS. 15A-15C depict data that indicate the presence of two chromosomeswhen the sample is entirely maternal (no fetal cfDNA present, FIG. 15A),contains a moderate fetal cfDNA fraction of 12% (FIG. 15B), or containsa high fetal cfDNA fraction of 26% (FIG. 15C). The x-axis represents thelinear position of the individual polymorphic loci along the chromosome,and the y-axis represents the number of A allele reads as a fraction ofthe total (A+B) allele reads. Maternal and fetal genotypes are indicatedto the right of the plots. The plots are color-coded according tomaternal genotype, such that red indicates a maternal genotype of AA,blue indicates a maternal genotype of BB, and green indicates a maternalgenotype of AB. Note that the measurements are made on total cfDNAisolated from maternal blood, and the cfDNA includes both maternal andfetal cfDNA; thus, each spot represents the combination of the fetal andmaternal DNA contribution for that SNP. Therefore, increasing theproportion of maternal cfDNA from 0% to 100% will gradually shift somespots up or down within the plots, depending on the maternal and fetalgenotype.

In all cases, SNPs that are homozygous for the A allele (AA) in both themother and the fetus are found tightly associated with the upper limitof the plots, as the fraction of A allele reads is high because thereshould be no B alleles present. Conversely, SNPs that are homozygous forthe B allele in both the mother and the fetus are found tightlyassociated with the lower limit of the plots, as the fraction of Aallele reads is low because there should be only B alleles. The spotsthat are not tightly associated with the upper and lower limits of theplots represent SNPs for which the mother, the fetus, or both areheterozygous; these spots are useful for identifying fetal deletions orduplications, but can also be informative for determining paternalversus maternal inheritance. These spots segregate based on bothmaternal and fetal genotypes and fetal fraction, and as such the preciseposition of each individual spot along the y-axis depends on bothstoichiometry and fetal fraction. For example, loci where the mother isAA and the fetus is AB are expected to have a different fraction of Aallele reads, and thus different positioning along the y-axis, dependingon the fetal fraction.

FIG. 15A has data for a non-pregnant woman, and thus represents thepattern when the genotype is entirely maternal. This pattern includes“clusters” of spots: a red cluster tightly associated with the top ofthe plot (SNPs where the maternal genotype is AA), a blue clustertightly associated with the bottom of the plot (SNPs where the maternalgenotype is BB), and a single, centered green cluster (SNPs where thematernal genotype is AB). For FIG. 15B, the contribution of fetalalleles to the fraction of A allele reads shifts the position of someallele spots up or down along the y-axis. For FIG. 15C, the pattern,including two red and two blue peripheral bands and a trio of centralgreen bands, is readily apparent. The three central green bandscorrespond to SNPs that are heterozygous in the mother, and two“peripheral” bands each at both the top (red) and bottom (blue) of theplots correspond to SNPs that are homozygous in the mother.

Analysis of a 22q11.2 deletion carrier (a mother with this deletion) isshown in FIG. 16A. The deletion carrier does not have heterozygous SNPsin this region since the carrier only has one copy of this region. Thus,this deletion is indicated by the absence of the green AB SNPs. Theanalysis of a paternally inherited 22q11 deletion in a fetus is shown inFIG. 16B. When the fetus only inherits a single copy of a chromosomesegment (in the case of a paternally inherited deletion, the copypresent in the fetus comes from the mother), and thus only inherits asingle allele for each locus in this segment, heterozygosity of thefetus is not possible. As such, the only possible fetal SNP identitiesare A or B. Note the absence of internal peripheral bands. For apaternally inherited deletion, the characteristic pattern includes twocentral green bands that represent SNPs for which the mother isheterozygous, and only has single peripheral red and blue bands thatrepresent SNPs for which the mother is homozygous, and which remaintightly associated with the upper and lower limits of the plots (1 and0), respectively.

Analysis of a maternally inherited Cri-du-Chat deletion syndrome isshown in FIG. 17 . There are two central green bands instead of threegreen bands, and there are two red and two blue peripheral bands. Amaternally inherited deletion (such as a maternal carrier of Duchenne'smuscular dystrophy) can also be detected based on the small amount ofsignal in that region of the deletion in a mixed sample of maternal andfetal DNA (such as a plasma sample) due to both the mother and the fetushaving the deletion.

FIG. 18 is a plot of a paternally inherited Wolf-Hirschhorn deletionsyndrome, as indicated by the presence of one red and one blueperipheral band.

If desired, similar plots can be generated for a sample from anindividual suspected of having a deletion or duplication, such as a CNVassociated with cancer. In such plots, the following color coding can beused based on the genotype of cells without the CNV: red indicates agenotype of AA, blue indicates a genotype of BB, and green indicates agenotype of AB. In some embodiments for a deletion, the pattern includestwo central green bands that represent SNPs for which the individual isheterozygous (top green band represents AB from cells without thedeletion and A from cells with the deletion, and bottom green bandrepresents AB from cells without the deletion and B from cells with thedeletion), and only has single peripheral red and blue bands thatrepresent SNPs for which the individual is homozygous, and which remaintightly associated with the upper and lower limits of the plots (1 and0), respectively. In some embodiments, the separation of the two greenbands increases as the fraction of cells, DNA, or RNA with the deletionincreases.

Exemplary Methods for Identifying and Analyzing Multiple Pregnancies

In some embodiments, any of the methods of the present invention areused to detect the presence of a multiple pregnancy, such as a twinpregnancy, where at least one of the fetuses is genetically differentfrom at least one other fetus. In some embodiments, fraternal twins areidentified based on the presence of two fetus with different allele,different allele ratios, or different allele distributions at some (orall) of the tested loci. In some embodiments, fraternal twins areidentified by determining the expected allele ratio at each locus (suchas SNP loci) for two fetuses that may have the same or different fetalfractions in the sample (such as a plasma sample). In some embodiments,the likelihood of a particular pair of fetal fractions (where f1 is thefetal fraction for fetus 1, and f2 is the fetal fraction for fetus 2) iscalculated by considering some or all of the possible genotypes of thetwo fetuses, conditioned on the mother's genotype and genotypepopulation frequencies. The mixture of two fetal and one maternalgenotype, combined with the fetal fractions, determine the expectedallele ratio at a SNP. For example, if the mother is AA, fetus 1 is AA,and fetus 2 is AB, the overall fraction of B allele at the SNP isone-half of f2. The likelihood calculation asks how well all of the SNPstogether match the expected allele ratios based on all of the possiblecombinations of fetal genotypes. The fetal fraction pair (f1, f2) thatbest matches the data is selected. It is not necessary to calculatedspecific genotypes of the fetuses; instead, one can, for example,considered all of the possible genotypes in a statistical combination.In some embodiments, if the method does not distinguish betweensingleton and identical twins, an ultrasound can be performed todetermine whether there is a singleton or identical twin pregnancy. Ifthe ultrasound detects a twin pregnancy it can be assumed that thepregnancy is an identical twin pregnancy because a fraternal twinpregnancy would have been detected based on the SNP analysis discussedabove.

In some embodiments, a pregnant mother is known to have a multiplepregnancy (such as a twin pregnancy) based on prior testing, such as anultrasound. Any of the methods of the present invention can be used todetermine whether the multiple pregnancy includes identical or fraternaltwins. For example, the measured allele ratios can be compared to whatwould be expected for identical twins (the same allele ratios as asingleton pregnancy) or for fraternal twins (such as the calculation ofallele ratios as described above). Some identical twins aremonochorionic twins, which have a risk of twin-to-twin transfusionsyndrome. Thus, twins determined to be identical twins using a method ofthe invention are desirably tested (such as by ultrasound) to determineif they are monochorionic twins, and if so, these twins can be monitored(such as bi-weekly ultrasounds from 16 weeks) for signs of win-to-twintransfusion syndrome.

In some embodiments, any of the methods of the present invention areused to determine whether any of the fetuses in a multiple pregnancy,such as a twin pregnancy, are aneuploid. Aneuploidy testing for twinsbegins with the fetal fraction estimate. In some embodiments, the fetalfraction pair (f1, f2) that best matches the data is selected asdescribed above. In some embodiments, a maximum likelihood estimate isperformed for the parameter pair (f1, f2) over the range of possiblefetal fractions. In some embodiments, the range of f2 is from 0 to f1because f2 is defined as the smaller fetal fraction. Given a pair (f1,f2), data likelihood is calculated from the allele ratios observed at aset of loci such as SNP loci. In some embodiments, the data likelihoodreflects the genotypes of the mother, the father if available,population frequencies, and the resulting probabilities of fetalgenotypes. In some embodiments, SNPs are assumed independent. Theestimated fetal fraction pair is the one that produces the highest datalikelihood. If f2 is 0 then the data is best explained by only one setof fetal genotypes, indicating identical twins, where f1 is the combinedfetal fraction. Otherwise f1 and f2 are the estimates of the individualtwin fetal fractions. Having established the best estimate of (f1, f2),one can predict the overall fraction of B allele in the plasma for anycombination of maternal and fetal genotypes, if desired. It is notnecessary to assign individual sequence reads to the individual fetuses.Ploidy testing is performed using another maximum likelihood estimatewhich compares the data likelihood of two hypotheses. In someembodiments for identical twins, one consider the hypotheses (i) bothtwins are euploid, and (ii) both twins are trisomic. In some embodimentsfor fraternal twins, one considers the hypotheses (i) both twins areeuploid and (ii) at least one twin is trisomic. The trisomy hypothesesfor fraternal twins are based on the lower fetal fraction, since atrisomy in the twin with a higher fetal fraction would also be detected.Ploidy likelihoods are calculated using a method which predicts theexpected number of reads at each targeted genome locus conditioned oneither the disomy or trisomy hypothesis. There is no requirement for adisomy reference chromosome. The variance model for the expected numberof reads takes into account the performance of individual target loci aswell as the correlation between loci (see, for example, U.S. Ser. No.62/008,235, filed Jun. 5, 2014, and U.S. Ser. No. 62/032,785, filed Aug.4, 2014, which are each hereby incorporated by reference in itsentirety). If the smaller twin has fetal fraction f1, our ability todetect a trisomy in that twin is equivalent to our ability to detect atrisomy in a singleton pregnancy at the same fetal fraction. This isbecause the part of the method that detects the trisomy in someembodiments does not depend on genotypes and does not distinguishbetween multiple or singleton pregnancy. It simply looks for anincreased number of reads in accordance with the determined fetalfraction.

In some embodiments, the method includes detecting the presence of twinsbased on SNP loci (such as described above). If twins are detected, SPNsare used to determine the fetal fraction of each fetus (f1, f2) such asdescribed above. In some embodiments, samples that have high confidencedisomy calls are used to determine the amplification bias on a per-SNPbasis. In some embodiments, these samples with high confidence disomycalls are analyzed in the same run as one or more samples of interest.In some embodiments, the amplification bias on a per-SNP basis is usedto model the distribution of reads for one or more chromosomes orchromosome segments of interest such as chromosome 21 that are expectedor the disomy hypothesis and the trisomy hypothesis given the lower ofthe two twin fetal fraction. The likelihood or probability of disomy ortrisomy is calculated given the two models and the measured quantity ofthe chromosome or chromosome segment of interest.

In some embodiments, the threshold for a positive aneuploidy call (suchas a trisomy call) is set based on the twin with the lower fetalfraction. This way, if the other twin is positive, or if both arepositive, the total chromosome representation is definitely above thethreshold.

Exemplary Counting Methods/Quantitative Methods

In some embodiments, one or more counting methods (also referred to asquantitative methods) are used to detect one or more CNS, such asdeletions or duplications of chromosome segments or entire chromosomes.In some embodiments, one or more counting methods are used to determinewhether the overrepresentation of the number of copies of the firsthomologous chromosome segment is due to a duplication of the firsthomologous chromosome segment or a deletion of the second homologouschromosome segment. In some embodiments, one or more counting methodsare used to determine the number of extra copies of a chromosome segmentor chromosome that is duplicated (such as whether there are 1, 2, 3, 4,or more extra copies). In some embodiments, one or more counting methodsare used to differentiate a sample has many duplications and a smallertumor fraction from a sample with fewer duplications and a larger tumorfraction. For example, one or more counting methods may be used todifferentiate a sample with four extra chromosome copies and a tumorfraction of 10% from a sample with two extra chromosome copies and atumor fraction of 20%. Exemplary methods are disclosed, e.g. U.S.Publication Nos. 2007/0184467; 2013/0172211; and 2012/0003637; U.S. Pat.Nos. 8,467,976; 7,888,017; 8,008,018; 8,296,076; and 8,195,415; U.S.Ser. No. 62/008,235, filed Jun. 5, 2014, and U.S. Ser. No. 62/032,785,filed Aug. 4, 2014, which are each hereby incorporated by reference inits entirety.

In some embodiment, the counting method includes counting the number ofDNA sequence-based reads that map to one or more given chromosomes orchromosome segments. Some such methods involve creation of a referencevalue (cut-off value) for the number of DNA sequence reads mapping to aspecific chromosome or chromosome segment, wherein a number of reads inexcess of the value is indicative of a specific genetic abnormality.

In some embodiments, the total measured quantity of all the alleles forone or more loci (such as the total amount of a polymorphic ornon-polymorphic locus) is compared to a reference amount. In someembodiments, the reference amount is (i) a threshold value or (ii) anexpected amount for a particular copy number hypothesis. In someembodiments, the reference amount (for the absence of a CNV) is thetotal measured quantity of all the alleles for one or more loci for oneor more chromosomes or chromosomes segments known or expected to nothave a deletion or duplication. In some embodiments, the referenceamount (for the presence of a CNV) is the total measured quantity of allthe alleles for one or more loci for one or more chromosomes orchromosomes segments known or expected to have a deletion orduplication. In some embodiments, the reference amount is the totalmeasured quantity of all the alleles for one or more loci for one ormore reference chromosomes or chromosome segments. In some embodiments,the reference amount is the mean or median of the values determined fortwo or more different chromosomes, chromosome segments, or differentsamples. In some embodiments, random (e.g., massively parallel shotgunsequencing) or targeted sequencing is used to determine the amount ofone or more polymorphic or non-polymorphic loci.

In some embodiments utilizing a reference amount, the method includes(a) measuring the amount of genetic material on a chromosome orchromosome segment of interest; (b) comparing the amount from step (a)to a reference amount; and (c) identifying the presence or absence of adeletion or duplication based on the comparison.

In some embodiments utilizing a reference chromosome or chromosomesegment, the method includes sequencing DNA or RNA from a sample toobtain a plurality of sequence tags aligning to target loci. In someembodiments, the sequence tags are of sufficient length to be assignedto a specific target locus (e.g., 15-100 nucleotides in length); thetarget loci are from a plurality of different chromosomes or chromosomesegments that include at least one first chromosome or chromosomesegment suspected of having an abnormal distribution in the sample andat least one second chromosome or chromosome segment presumed to benormally distributed in the sample. In some embodiments, the pluralityof sequence tags are assigned to their corresponding target loci. Insome embodiments, the number of sequence tags aligning to the targetloci of the first chromosome or chromosome segment and the number ofsequence tags aligning to the target loci of the second chromosome orchromosome segment are determined. In some embodiments, these numbersare compared to determine the presence or absence of an abnormaldistribution (such as a deletion or duplication) of the first chromosomeor chromosome segment.

In some embodiments, the value off (such as the fetal fraction or tumorfraction) is used in the CNV determination, such as to compare theobserved difference between the amount of two chromosomes or chromosomesegments to the difference that would be expected for a particular typeof CNV given the value off (see, e.g., US Publication No 2012/0190020;US Publication No 2012/0190021; US Publication No 2012/0190557; USPublication No 2012/0191358, which are each hereby incorporated byreference in its entirety). For example, the difference in the amount ofa chromosome segment that is duplicated in a fetus compared to a disomicreference chromosome segment in a blood sample from a mother carryingthe fetus increases as the fetal fraction increases. Additionally, thedifference in the amount of a chromosome segment that is duplicated in atumor compared to a disomic reference chromosome segment increases asthe tumor fraction increases. In some embodiments, the method includescomparing the relative frequency of a chromosome or chromosome segmentof interest to a reference chromosomes or chromosome segment (such as achromosome or chromosome segment expected or known to be disomic) to thevalue off to determine the likelihood of the CNV. For example, thedifference in amounts between the first chromosomes or chromosomesegment to the reference chromosome or chromosome segment can becompared to what would be expected given the value off for variouspossible CNVs (such as one or two extra copies of a chromosome segmentof interest).

The following prophetic examples illustrate the use of a countingmethod/quantitative method to differentiate between a duplication of thefirst homologous chromosome segment and a deletion of the secondhomologous chromosome segment. If one considers the normal disomicgenome of the host to be the baseline, then analysis of a mixture ofnormal and cancer cells yields the average difference between thebaseline and the cancer DNA in the mixture. For example, imagine a casewhere 10% of the DNA in the sample originated from cells with a deletionover a region of a chromosome that is targeted by the assay. In someembodiments, a quantitative approach shows that the quantity of readscorresponding to that region is expected to be 95% of what is expectedfor a normal sample. This is because one of the two target chromosomalregions in each of the tumor cells with a deletion of the targetedregion is missing, and thus the total amount of DNA mapping to thatregion is 90% (for the normal cells) plus ½×10% (for the tumorcells)=95%. Alternately in some embodiments, an allelic approach showsthat the ratio of alleles at heterozygous loci averaged 19:20. Nowimagine a case where 10% of the DNA in the sample originated from cellswith a five-fold focal amplification of a region of a chromosome that istargeted by the assay. In some embodiments, a quantitative approachshows that the quantity of reads corresponding to that region isexpected to be 125% of what is expected for a normal sample. This isbecause one of the two target chromosomal regions in each of the tumorcells with a five-fold focal amplification is copied an extra five timesover the targeted region, and thus the total amount of DNA mapping tothat region is 90% (for the normal cells) plus (2+5)×10%/2 (for thetumor cells)=125%. Alternately in some embodiments, an allelic approachshows that the ratio of alleles at heterozygous loci averaged 25:20.Note that when using an allelic approach alone, a focal amplification offive-fold over a chromosomal region in a sample with 10% cfDNA mayappear the same as a deletion over the same region in a sample with 40%cfDNA; in these two cases, the haplotype that is under-represented inthe case of the deletion appears to be the haplotype without a CNV inthe case with the focal duplication, and the haplotype without a CNV inthe case of the deletion appears to be the over-represented haplotype inthe case with the focal duplication. Combining the likelihoods producedby this allelic approach with likelihoods produced by a quantitativeapproach differentiates between the two possibilities.

Exemplary Counting Methods/Quantitative Methods Using Reference Samples

An exemplary quantitative method that uses one or more reference samplesis described in U.S. Ser. No. 62/008,235, filed Jun. 5, 2014 and U.S.Ser. No. 62/032,785, filed Aug. 4, 2014, which is hereby incorporated byreference in its entirety. In some embodiments, one or more referencesamples most likely to not have any CNVs on one or more chromosomes orchromosomes of interest (e.g., a normal sample) are identified byselecting the samples with the highest fraction of tumor DNA, selectingthe samples with the z-score closest to zero, selecting the sampleswhere the data fits the hypothesis corresponding to no CNVs with thehighest confidence or likelihood, selecting the samples known to benormal, selecting the samples from individuals with the lowestlikelihood of having cancer (e.g., having a low age, being a male whenscreening for breast cancer, having no family history, etc.), selectingthe samples with the highest input amount of DNA, selecting the sampleswith the highest signal to noise ratio, selecting samples based on othercriteria believed to be correlated to the likelihood of having cancer,or selecting samples using some combination of criteria. Once thereference set is chosen, one can make the assumption that these casesare disomic, and then estimate the per-SNP bias, that is, theexperiment-specific amplification and other processing bias for eachlocus. Then, one can use this experiment-specific bias estimate tocorrect the bias in the measurements of the chromosome of interest, suchas chromosome 21 loci, and for the other chromosome loci as appropriate,for the samples that are not part of the subset where disomy is assumedfor chromosome 21. Once the biases have been corrected for in thesesamples of unknown ploidy, the data for these samples can then beanalyzed a second time using the same or a different method to determinewhether the individuals (such as fetuses) are afflicted with trisomy 21.For example, a quantitative method can be used on the remaining samplesof unknown ploidy, and a z-score can be calculated using the correctedmeasured genetic data on chromosome 21. Alternately, as part of thepreliminary estimate of the ploidy state of chromosome 21, a fetalfraction (or tumor fraction for samples from an individual suspected ofhaving cancer) can be calculated. The proportion of corrected reads thatare expected in the case of a disomy (the disomy hypothesis), and theproportion of corrected reads that are expected in the case of a trisomy(the trisomy hypothesis) can be calculated for a case with that fetalfraction. Alternately, if the fetal fraction was not measuredpreviously, a set of disomy and trisomy hypotheses can be generated fordifferent fetal fractions. For each case, an expected distribution ofthe proportion of corrected reads can be calculated given expectedstatistical variation in the selection and measurement of the variousDNA loci. The observed corrected proportion of reads can be compared tothe distribution of the expected proportion of corrected reads, and alikelihood ratio can be calculated for the disomy and trisomyhypotheses, for each of the samples of unknown ploidy. The ploidy stateassociated with the hypothesis with the highest calculated likelihoodcan be selected as the correct ploidy state.

In some embodiments, a subset of the samples with a sufficiently lowlikelihood of having cancer may be selected to act as a control set ofsamples. The subset can be a fixed number, or it can be a variablenumber that is based on choosing only those samples that fall below athreshold. The quantitative data from the subset of samples may becombined, averaged, or combined using a weighted average where theweighting is based on the likelihood of the sample being normal. Thequantitative data may be used to determine the per-locus bias for theamplification the sequencing of samples in the instant batch of controlsamples. The per-locus bias may also include data from other batches ofsamples. The per-locus bias may indicate the relative over- orunder-amplification that is observed for that locus compared to otherloci, making the assumption that the subset of samples do not containany CNVs, and that any observed over or under-amplification is due toamplification and/or sequencing or other bias. The per-locus bias maytake into account the GC content of the amplicon. The loci may begrouped into groups of loci for the purpose of calculating a per-locusbias. Once the per-locus bias has been calculated for each locus in theplurality of loci, the sequencing data for one or more of the samplesthat are not in the subset of the samples, and optionally one or more ofthe samples that are in the subset of samples, may be corrected byadjusting the quantitative measurements for each locus to remove theeffect of the bias at that locus. For example, if SNP 1 was observed, inthe subset of patients, to have a depth of read that is twice as greatas the average, the adjustment may involve replacing the number of readscorresponding from SNP 1 with a number that is half as great. If thelocus in question is a SNP, the adjustment may involve cutting thenumber of reads corresponding to each of the alleles at that locus inhalf. Once the sequencing data for each of the loci in one or moresamples has been adjusted, it may be analyzed using a method for thepurpose of detecting the presence of a CNV at one or more chromosomalregions.

In an example, sample A is a mixture of amplified DNA originating from amixture of normal and cancerous cells that is analyzed using aquantitative method. The following illustrates exemplary possible data.A region of the q arm on chromosome 22 is found to only have 90% as muchDNA mapping to that region as expected; a focal region corresponding tothe HER2 gene is found to have 150% as much DNA mapping to that regionas expected; and the p-arm of chromosome 5 is found to have 105% as muchDNA mapping to it as expected. A clinician may infer that the sample hasa deletion of a region on the q arm on chromosome 22, and a duplicationof the HER2 gene. The clinician may infer that since the 22q deletionsare common in breast cancer, and that since cells with a deletion of the22q region on both chromosomes usually do not survive, thatapproximately 20% of the DNA in the sample came from cells with a 22qdeletion on one of the two chromosomes. The clinician may also inferthat if the DNA from the mixed sample that originated from tumor cellsoriginated from a set of genetically tumor cells whose HER2 region and22q regions were homogenous, then the cells contained a five-foldduplication of the HER2 region.

In an example, Sample A is also analyzed using an allelic method. Thefollowing illustrates exemplary possible data. The two haplotypes onsame region on the q arm on chromosome 22 are present in a ratio of 4:5;the two haplotypes in a focal region corresponding to the HER2 gene arepresent in ratios of 1:2; and the two haplotypes in the p-arm ofchromosome 5 are present in ratios of 20:21. All other assayed regionsof the genome have no statistically significant excess of eitherhaplotype. A clinician may infer that the sample contains DNA from atumor with a CNV in the 22q region, the HER2 region, and the 5p arm.Based on the knowledge that 22q deletions are very common in breastcancer, and/or the quantitative analysis showing an under-representationof the amount of DNA mapping to the 22q region of the genome, theclinician may infer the existence of a tumor with a 22q deletion. Basedon the knowledge that HER2 amplifications are very common in breastcancer, and/or the quantitative analysis showing an over-representationof the amount of DNA mapping to the HER2 region of the genome, theclinician may infer the existence of a tumor with a HER2 amplification.

Exemplary Reference Chromosomes or Chromosome Segments

In some embodiments, any of the methods described herein are alsoperformed on one or more reference chromosomes or chromosomes segmentsand the results are compared to those for one or more chromosomes orchromosome segments of interest.

In some embodiments, the reference chromosome or chromosome segment isused as a control for what would be expected for the absence of a CNV.In some embodiments, the reference is the same chromosome or chromosomesegment from one or more different samples known or expected to not havea deletion or duplication in that chromosome or chromosome segment. Insome embodiments, the reference is a different chromosome or chromosomesegment from the sample being tested that is expected to be disomic. Insome embodiments, the reference is a different segment from one of thechromosomes of interest in the same sample that is being tested. Forexample, the reference may be one or more segments outside of the regionof a potential deletion or duplication. Having a reference on the samechromosome that is being tested avoids variability between differentchromosomes, such as differences in metabolism, apoptosis, histones,inactivation, and/or amplification between chromosomes. Analyzingsegments without a CNV on the same chromosome as the one being testedcan also be used to determine differences in metabolism, apoptosis,histones, inactivation, and/or amplification between homologs, allowingthe level of variability between homologs in the absence of a CNV to bedetermined for comparison to the results from a potential CNV. In someembodiments, the magnitude of the difference between the calculated andexpected allele ratios for a potential CNV is greater than thecorresponding magnitude for the reference, thereby confirming thepresence of a CNV.

In some embodiments, the reference chromosome or chromosome segment isused as a control for what would be expected for the presence of a CNV,such as a particular deletion or duplication of interest. In someembodiments, the reference is the same chromosome or chromosome segmentfrom one or more different samples known or expected to have a deletionor duplication in that chromosome or chromosome segment. In someembodiments, the reference is a different chromosome or chromosomesegment from the sample being tested that is known or expected to have aCNV. In some embodiments, the magnitude of the difference between thecalculated and expected allele ratios for a potential CNV is similar to(such as not significantly different) than the corresponding magnitudefor the reference for the CNV, thereby confirming the presence of a CNV.In some embodiments, the magnitude of the difference between thecalculated and expected allele ratios for a potential CNV is less than(such as significantly less) than the corresponding magnitude for thereference for the CNV, thereby confirming the absence of a CNV. In someembodiments, one or more loci for which the genotype of a cancer cell(or DNA or RNA from a cancer cell such as cfDNA or cfRNA) differs fromthe genotype of a noncancerous cell (or DNA or RNA from a noncancerouscell such as cfDNA or cfRNA) is used to determine the tumor fraction.The tumor fraction can be used to determine whether theoverrepresentation of the number of copies of the first homologouschromosome segment is due to a duplication of the first homologouschromosome segment or a deletion of the second homologous chromosomesegment. The tumor fraction can also be used to determine the number ofextra copies of a chromosome segment or chromosome that is duplicated(such as whether there are 1, 2, 3, 4, or more extra copies), such as todifferentiate a sample with four extra chromosome copies and a tumorfraction of 10% from a sample with two extra chromosome copies and atumor fraction of 20%. The tumor fraction can also be used to determinehow well the observed data fits the expected data for possible CNVs. Insome embodiments, the degree of overrepresentation of a CNV is used toselect a particular therapy or therapeutic regimen for the individual.For example, some therapeutic agents are only effective for at leastfour, six, or more copies of a chromosome segment.

In some embodiments, the one or more loci used to determine the tumorfraction are on a reference chromosome or chromosomes segment, such as achromosome or chromosome segment known or expected to be disomic, achromosome or chromosome segment that is rarely duplicated or deleted incancer cells in general or in a particular type of cancer that anindividual is known to have or is at increased risk of having, or achromosome or chromosome segment that is unlikely to be aneuploid (suchsegment that is expected to lead to cell death if deleted orduplicated). In some embodiments, any of the methods of the inventionare used to confirm that the reference chromosome or chromosome segmentis disomic in both the cancer cells and noncancerous cells. In someembodiments, one or more chromosomes or chromosomes segments for whichthe confidence for a disomy call is high are used.

Exemplary loci that can be used to determine the tumor fraction includepolymorphisms or mutations (such as SNPs) in a cancer cell (or DNA orRNA such as cfDNA or cfRNA from a cancer cell) that aren't present in anoncancerous cell (or DNA or RNA from a noncancerous cell) in theindividual. In some embodiments, the tumor fraction is determined byidentifying those polymorphic loci where a cancer cell (or DNA or RNAfrom a cancer cell) has an allele that is absent in noncancerous cells(or DNA or RNA from a noncancerous cell) in a sample (such as a plasmasample or tumor biopsy) from an individual; and using the amount of theallele unique to the cancer cell at one or more of the identifiedpolymorphic loci to determine the tumor fraction in the sample. In someembodiments, a noncancerous cell is homozygous for a first allele at thepolymorphic locus, and a cancer cell is (i) heterozygous for the firstallele and a second allele or (ii) homozygous for a second allele at thepolymorphic locus. In some embodiments, a noncancerous cell isheterozygous for a first allele and a second allele at the polymorphiclocus, and a cancer cell is (i) has one or two copies of a third alleleat the polymorphic locus. In some embodiments, the cancer cells areassumed or known to only have one copy of the allele that is not presentin the noncancerous cells. For example, if the genotype of thenoncancerous cells is AA and the cancer cells is AB and 5% of the signalat that locus in a sample is from the B allele and 95% is from the Aallele, then the tumor fraction of the sample is 10%. In someembodiments, the cancer cells are assumed or known to have two copies ofthe allele that is not present in the noncancerous cells. For example,if the genotype of the noncancerous cells is AA and the cancer cells isBB and 5% of the signal at that locus in a sample is from the B alleleand 95% is from the A allele, the tumor fraction of the sample is 5%. Insome embodiments, multiple loci for which the cancer cells have anallele not in the noncancerous cells are analyzed to determine which ofthe loci in the cancer cells are heterozygous and which are homozygous.For example for loci in which the noncancerous cells are AA, if thesignal from the B allele is ˜5% at some loci and ˜10% at some loci, thenthe cancer cells are assumed to be heterozygous at loci with ˜5% Ballele, and homozygous at loci with ˜10% B allele (indicating the tumorfraction is ˜10%).

Exemplary loci that can be used to determine the tumor fraction includeloci for which a cancer cell and noncancerous cell have one allele incommon (such as loci in which the cancer cell is AB and the noncancerouscell is BB, or the cancer cell is BB and the noncancerous cell is AB).The amount of A signal, the amount of B signal, or the ratio of A to Bsignal in a mixed sample (containing DNA or RNA from a cancer cell and anoncancerous cell) is compared to the corresponding value for (i) asample containing DNA or RNA from only cancer cells or (ii) a samplecontaining DNA or RNA from only noncancerous cells. The difference invalues is used to determine the tumor fraction of the mixed sample.

In some embodiments, loci that can be used to determine the tumorfraction are selected based on the genotype of (i) a sample containingDNA or RNA from only cancer cells, and/or (ii) a sample containing DNAor RNA from only noncancerous cells. In some embodiments, the loci areselected based on analysis of the mixed sample, such as loci for whichthe absolute or relative amounts of each allele differs from what wouldbe expected if both the cancer and noncancerous cells have the samegenotype at a particular locus. For example, if the cancer andnoncancerous cells have the same genotype, the loci would be expected toproduce 0% B signal if all the cells are AA, 50% B signal if all thecells are AB, or 100% B signal if all the cells are BB. Other values forthe B signal indicate that the genotype of the cancer and noncancerouscells are different at that locus and thus that locus can be used todetermine the tumor fraction.

In some embodiments, the tumor fraction calculated based on the allelesat one or more loci is compared to the tumor fraction calculated usingone or more of the counting methods disclosed herein.

Exemplary Methods for Detecting a Phenotype or Analyzing MultipleMutations

In some embodiments, the method includes analyzing a sample for a set ofmutations associated with a disease or disorder (such as cancer) or anincreased risk for a disease or disorder. There are strong correlationsbetween events within classes (such as M or C cancer classes) which canbe used to improve the signal to noise ratio of a method and classifytumors into distinct clinical subsets. For example, borderline resultsfor a few mutations (such as a few CNVs) on one or more chromosomes orchromosomes segments considered jointly may be a very strong signal. Insome embodiments, determining the presence or absence of multiplepolymorphisms or mutations of interest (such as 2, 3, 4, 5, 8, 10, 12,15, or more) increases the sensitivity and/or specificity of thedetermination of the presence or absence of a disease or disorder suchas cancer, or an increased risk for with a disease or disorder such ascancer. In some embodiments, the correlation between events acrossmultiple chromosomes is used to more powerfully look at a signalcompared to looking at each of them individually. The design of themethod itself can be optimized to best categorize tumors. This may beincredibly useful for early detection and screening—vis-a-vis recurrencewhere sensitivity to one particular mutation/CNV may be paramount. Insome embodiments, the events are not always correlated but have aprobability of being correlated. In some embodiments, a matrixestimation formulation with a noise covariance matrix that has offdiagonal terms is used.

In some embodiments, the invention features a method for detecting aphenotype (such as a cancer phenotype) in an individual, wherein thephenotype is defined by the presence of at least one of a set ofmutations. In some embodiments, the method includes obtaining DNA or RNAmeasurements for a sample of DNA or RNA from one or more cells from theindividual, wherein one or more of the cells is suspected of having thephenotype; and analyzing the DNA or RNA measurements to determine, foreach of the mutations in the set of mutations, the likelihood that atleast one of the cells has that mutation. In some embodiments, themethod includes determining that the individual has the phenotype ifeither (i) for at least one of the mutations, the likelihood that atleast one of the cells contains that mutations is greater than athreshold, or (ii) for at least one of the mutations, the likelihoodthat at least one of the cells has that mutations is less than thethreshold, and for a plurality of the mutations, the combined likelihoodthat at least one of the cells has at least one of the mutations isgreater than the threshold. In some embodiments, one or more cells havea subset or all of the mutations in the set of mutations. In someembodiments, the subset of mutations is associated with cancer or anincreased risk for cancer. In some embodiments, the set of mutationsincludes a subset or all of the mutations in the M class of cancermutations (Ciriello, Nat Genet. 45(10):1127-1133, 2013, doi:10.1038/ng.2762, which is hereby incorporated by reference in itsentirety). In some embodiments, the set of mutations includes a subsetor all of the mutations in the C class of cancer mutations (Ciriello,supra). In some embodiments, the sample includes cell-free DNA or RNA.In some embodiments, the DNA or RNA measurements include measurements(such as the quantity of each allele at each locus) at a set ofpolymorphic loci on one or more chromosomes or chromosome segments ofinterest.

Exemplary Methods for Paternity Testing or Genetic Relatedness Testing

The methods of the invention can be used to improve the accuracy ofpaternity testing or other genetic relatedness testing (see, e.g., U.S.Publication No. 2012/0122701, filed Dec. 22, 2011, which is herebyincorporated by reference in its entirety). For example, the multiplexPCR method can allow thousands of polymorphic loci (such as SNPs) to beanalyzed for use in the PARENTAL SUPPORT algorithm described herein todetermine whether an alleged father in is the biological father of afetus. In some embodiments, the invention features a method forestablishing whether an alleged father is the biological father of afetus that is gestating in a pregnant mother. In some embodiments, themethod involves obtaining phased genetic data for the alleged father(such as by using another of the methods described herein for phasinggenetic data), wherein the phased genetic data comprises the identity ofthe allele present for each locus in a set of polymorphic loci on afirst homologous chromosome segment and a second homologous chromosomesegment in the alleged father. In some embodiments, the method involvesobtaining genetic data at the set of polymorphic loci on the chromosomeor chromosome segment in a mixed sample of DNA comprising fetal DNA andmaternal DNA from the mother of the fetus by measuring the quantity ofeach allele at each locus. In some embodiments, the method involvescalculating, on a computer, expected genetic data for the mixed sampleof DNA from the phased genetic data for the alleged father; determining,on a computer, the probability that the alleged father is the biologicalfather of the fetus by comparing the obtaining genetic data made on themixed sample of DNA to the expected genetic data for the mixed sample ofDNA; and establishing whether the alleged father is the biologicalfather of the fetus using the determined probability that the allegedfather is the biological father of the fetus. In some embodiments, themethod involves obtaining phased genetic data for the biological motherof the fetus (such as by using another of the methods described hereinfor phasing genetic data), wherein the phased genetic data comprises theidentity of the allele present for each locus in a set of polymorphicloci on a first homologous chromosome segment and a second homologouschromosome segment in the mother. In some embodiments, the methodinvolves obtaining phased genetic data for the fetus (such as by usinganother of the methods described herein for phasing genetic data),wherein the phased genetic data comprises the identity of the allelepresent for each locus in a set of polymorphic loci on a firsthomologous chromosome segment and a second homologous chromosome segmentin the fetus. In some embodiments, the method involves calculating, on acomputer, expected genetic data for the mixed sample of DNA using thephased genetic data for the alleged father and using the phased geneticdata for the mother and/or the phased genetic data for the fetus.

In some embodiments, the invention features a method for establishingwhether an alleged father is the biological father of a fetus that isgestating in a pregnant mother. In some embodiments, the method involvesobtaining phased genetic data for the alleged father (such as by usinganother of the methods described herein for phasing genetic data),wherein the phased genetic data comprises the identity of the allelepresent for each locus in a set of polymorphic loci on a firsthomologous chromosome segment and a second homologous chromosome segmentin the alleged father. In some embodiments, the method involvesobtaining genetic data at the set of polymorphic loci on the chromosomeor chromosome segment in a mixed sample of DNA comprising fetal DNA andmaternal DNA from the mother of the fetus by measuring the quantity ofeach allele at each locus. In some embodiments, the method involvesidentifying (i) alleles that are present in the fetal DNA but are absentin the maternal DNA at polymorphic loci, and/or identifying (i) allelesthat are absent in the fetal DNA and the maternal DNA at polymorphicloci. In some embodiments, the method involves determining, on acomputer, the probability that the alleged father is the biologicalfather of the fetus; wherein the determination comprises: (1) comparing(i) the alleles that are present in the fetal DNA but are absent in thematernal DNA at polymorphic loci to (ii) the alleles at thecorresponding polymorphic loci in the genetic material from the allegedfather, and/or (2) comparing (i) the alleles that are absent in thefetal DNA and the maternal DNA at polymorphic loci to (ii) the allelesat the corresponding polymorphic loci in the genetic material from thealleged father; and establishing whether the alleged father is thebiological father of the fetus using the determined probability that thealleged father is the biological father of the fetus.

In some embodiments, a method described above for determining whether analleged father is the biological father of the fetus is used todetermine if an alleged relative (such as a grandparent, sibling, aunt,or uncle) of a fetus is an actual biological relative of the fetus (suchas by using genetic data of the alleged relative instead of genetic dataof the alleged father).

Exemplary Combinations of Methods

To increase the accuracy of the results, two or more methods (such asany of the methods of the invention or any known method) for detectingthe presence or absence of a CNV are performed. In some embodiments, oneor more methods for analyzing a factor (such as any of the methoddescribed herein or any known method) indicative of the presence orabsence of a disease or disorder or an increased risk for a disease ordisorder are performed.

In some embodiments, standard mathematical techniques are used tocalculate the covariance and/or correlation between two or more methods.Standard mathematical techniques may also be used to determine thecombined probability of a particular hypothesis based on two or moretests. Exemplary techniques include meta-analysis, Fisher's combinedprobability test for independent tests, Brown's method for combiningdependent p-values with known covariance, and Kost's method forcombining dependent p-values with unknown covariance. In cases where thelikelihoods are determined by a first method in a way that isorthogonal, or unrelated, to the way in which a likelihood is determinedfor a second method, combining the likelihoods is straightforward andcan be done by multiplication and normalization, or by using a formulasuch as:

R _(comb) =R ₁ R ₂ /[R ₁ R ₂+(1−R ₁)(1−R ₂)]

R_(comb) is the combined likelihood, and R₁ and R₂ are the individuallikelihoods. For example, if the likelihood of trisomy from method 1 is90%, and the likelihood of trisomy from method 2 is 95%, then combiningthe outputs from the two methods allows the clinician to conclude thatthe fetus is trisomic with a likelihood of (0.90)(0.95)/[(0.90)(0.95)+(1−0.90)(1 −0.95)]=99.42%. In cases where the first and the second methodsare not orthogonal, that is, where there is a correlation between thetwo methods, the likelihoods can still be combined.

Exemplary methods of analyzing multiple factors or variables aredisclosed in U.S. Pat. No. 8,024,128 issued on Sep. 20, 2011; U.S.Publication No. 2007/0027636, filed Jul. 31, 2006; and U.S. PublicationNo. 2007/0178501, filed Dec. 6, 2006, which are each hereby incorporatedby reference in its entirety).

In various embodiments, the combined probability of a particularhypothesis or diagnosis is greater than 80, 85, 90, 92, 94, 96, 98, 99,or 99.9%, or is greater than some other threshold value.

Limit of Detection

In some embodiments, a limit of detection of a mutation (such as an SNVor CNV) of a method of the invention is less than or equal to 10, 5, 2,1, 0.5, 0.1, 0.05, 0.01, or 0.005%. In some embodiments, a limit ofdetection of a mutation (such as an SNV or CNV) of a method of theinvention is between 15 to 0.005%, such as between 10 to 0.005%, 10 to0.01%, 10 to 0.1%, 5 to 0.005%, 5 to 0.01%, 5 to 0.1%, 1 to 0.005%, 1 to0.01%, 1 to 0.1%, 0.5 to 0.005%, 0.5 to 0.01%, 0.5 to 0.1%, or 0.1 to0.01, inclusive. In some embodiments, a limit of detection is such thata mutation (such as an SNV or CNV) that is present in less than or equalto 10, 5, 2, 1, 0.5, 0.1, 0.05, 0.01, or 0.005% of the DNA or RNAmolecules with that locus in a sample (such as a sample of cfDNA orcfRNA) is detected (or is capable of being detected). For example, themutation can be detected even if less than or equal to 10, 5, 2, 1, 0.5,0.1, 0.05, 0.01, or 0.005% of the DNA or RNA molecules that have thatlocus have that mutation in the locus (instead of, for example, awild-type or non-mutated version of the locus or a different mutation atthat locus). In some embodiments, a limit of detection is such that amutation (such as an SNV or CNV) that is present in less than or equalto 10, 5, 2, 1, 0.5, 0.1, 0.05, 0.01, or 0.005% of the DNA or RNAmolecules in a sample (such as a sample of cfDNA or cfRNA) is detected(or is capable of being detected). In some embodiments in which the CNVis a deletion, the deletion can be detected even if it is only presentin less than or equal to 10, 5, 2, 1, 0.5, 0.1, 0.05, 0.01, or 0.005% ofthe DNA or RNA molecules that have a region of interest that may or maynot contain the deletion in a sample. In some embodiments in which theCNV is a deletion, the deletion can be detected even if it is onlypresent in less than or equal to 10, 5, 2, 1, 0.5, 0.1, 0.05, 0.01, or0.005% of the DNA or RNA molecules in a sample. In some embodiments inwhich the CNV is a duplication, the duplication can be detected even ifthe extra duplicated DNA or RNA that is present is less than or equal to10, 5, 2, 1, 0.5, 0.1, 0.05, 0.01, or 0.005% of the DNA or RNA moleculesthat have a region of interest that may or may not be duplicated in asample in a sample. In some embodiments in which the CNV is aduplication, the duplication can be detected even if the extraduplicated DNA or RNA that is present is less than or equal to 10, 5, 2,1, 0.5, 0.1, 0.05, 0.01, or 0.005% of the DNA or RNA molecules in asample. Example 6 provides exemplary methods for calculating the limitof detection. In some embodiments, the “LOD-zs5.0-mr5” method of Example6 is used.

Exemplary Samples

In some embodiments of any of the aspects of the invention, the sampleincludes cellular and/or extracellular genetic material from cellssuspected of having a deletion or duplication, such as cells suspectedof being cancerous. In some embodiments, the sample comprises any tissueor bodily fluid suspected of containing cells, DNA, or RNA having adeletion or duplication, such as cancer cells, DNA, or RNA. The geneticmeasurements used as part of these methods can be made on any samplecomprising DNA or RNA, for example but not limited to, tissue, blood,serum, plasma, urine, hair, tears, saliva, skin, fingernails, feces,bile, lymph, cervical mucus, semen, or other cells or materialscomprising nucleic acids. Samples may include any cell type or DNA orRNA from any cell type may be used (such as cells from any organ ortissue suspected of being cancerous, or neurons). In some embodiments,the sample includes nuclear and/or mitochondrial DNA. In someembodiments, the sample is from any of the target individuals disclosedherein. In some embodiments, the target individual is a born individual,a gestating fetus, a non-gestating fetus such as a products ofconception sample, an embryo, or any other individual.

Exemplary samples include those containing cfDNA or cfRNA. In someembodiments, cfDNA is available for analysis without requiring the stepof lysing cells. Cell-free DNA may be obtained from a variety oftissues, such as tissues that are in liquid form, e.g., blood, plasma,lymph, ascites fluid, or cerebral spinal fluid. In some cases, cfDNA iscomprised of DNA derived from fetal cells. In some cases, cfDNA iscomprised of DNA derived from both fetal and maternal cells. In somecases, the cfDNA is isolated from plasma that has been isolated fromwhole blood that has been centrifuged to remove cellular material. ThecfDNA may be a mixture of DNA derived from target cells (such as cancercells) and non-target cells (such as non-cancer cells).

In some embodiments, the sample contains or is suspected to contain amixture of DNA (or RNA), such as mixture of cancer DNA (or RNA) andnoncancerous DNA (or RNA). In some embodiments, at least 0.5, 1, 3, 5,7, 10, 15, 20, 30, 40, 50, 60, 70, 80, 90, 92, 94, 95, 96, 98, 99, or100% of the cells in the sample are cancer cells. In some embodiments,at least 0.5, 1, 3, 5, 7, 10, 15, 20, 30, 40, 50, 60, 70, 80, 90, 92,94, 95, 96, 98, 99, or 100% of the DNA (such as cfDNA) or RNA (such ascfRNA) in the sample is from cancer cell(s). In various embodiments, thepercent of cells in the sample that are cancerous cells is between 0.5to 99%, such as between 1 to 95%, 5 to 95%, 10 to 90%, 5 to 70%, 10 to70%, 20 to 90%, or 20 to 70%, inclusive. In some embodiments, the sampleis enriched for cancer cells or for DNA or RNA from cancer cells. Insome embodiments in which the sample is enriched for cancer cells, atleast 0.5, 1, 2, 3, 4, 5, 6, 7, 10, 15, 20, 30, 40, 50, 60, 70, 80, 90,92, 94, 95, 96, 98, 99, or 100% of the cells in the enriched sample arecancer cells. In some embodiments in which the sample is enriched forDNA or RNA from cancer cells, at least 0.5, 1, 2, 3, 4, 5, 6, 7, 10, 15,20, 30, 40, 50, 60, 70, 80, 90, 92, 94, 95, 96, 98, 99, or 100% of theDNA or RNA in the enriched sample is from cancer cell(s). In someembodiments, cell sorting (such as Fluorescent Activated Cell Sorting(FACS)) is used to enrich for cancer cells (Barteneva et. al., BiochimBiophys Acta., 1836(1):105-22, August 2013. doi:10.1016/j.bbcan.2013.02.004. Epub 2013 February 24, and Ibrahim et al.,Adv Biochem Eng Biotechnol. 106:19-39, 2007, which are each herebyincorporated by reference in its entirety).

In some embodiments of any of the aspects of the invention, the samplecomprises any tissue suspected of being at least partially of fetalorigin. In some embodiments, the sample includes cellular and/orextracellular genetic material from the fetus, contaminating cellularand/or extracellular genetic material (such as genetic material from themother of the fetus), or combinations thereof. In some embodiments, thesample comprises cellular genetic material from the fetus, contaminatingcellular genetic material, or combinations thereof.

In some embodiments, the sample is from a gestating fetus. In someembodiments, the sample is from a non-gestating fetus, such as aproducts of conception sample or a sample from any fetal tissue afterfetal demise. In some embodiments, the sample is a maternal whole bloodsample, cells isolated from a maternal blood sample, maternal plasmasample, maternal serum sample, amniocentesis sample, placental tissuesample (e.g., chorionic villus, decidua, or placental membrane),cervical mucus sample, or other sample from a fetus. In someembodiments, at least 3, 5, 7, 10, 15, 20, 30, 40, 50, 60, 70, 80, 90,92, 94, 95, 96, 98, 99, or 100% of the cells in the sample are maternalcells. In various embodiments, the percent of cells in the sample thatare maternal cells is between 5 to 99%, such as between 10 to 95%, 20 to95%, 30 to 90%, 30 to 70%, 40 to 90%, 40 to 70%, 50 to 90%, or 50 to80%, inclusive.

In some embodiments, the sample is enriched for fetal cells. In someembodiments in which the sample is enriched for fetal cells, at least0.5, 1, 2, 3, 4, 5, 6, 7% or more of the cells in the enriched sampleare fetal cells. In some embodiments, the percent of cells in the samplethat are fetal cells is between 0.5 to 100%, such as between 1 to 99%, 5to 95%, 10 to 95%, 10 to 95%, 20 to 90%, or 30 to 70%, inclusive. Insome embodiments, the sample is enriched for fetal DNA. In someembodiments in which the sample is enriched for fetal DNA, at least 0.5,1, 2, 3, 4, 5, 6, 7% or more of the DNA in the enriched sample is fetalDNA. In some embodiments, the percent of DNA in the sample that is fetalDNA is between 0.5 to 100%, such as between 1 to 99%, 5 to 95%, 10 to95%, 10 to 95%, 20 to 90%, or 30 to 70%, inclusive.

In some embodiments, the sample includes a single cell or includes DNAand/or RNA from a single cell. In some embodiments, multiple individualcells (e.g., at least 5, 10, 20, 30, 40, or 50 cells from the samesubject or from different subjects) are analyzed in parallel. In someembodiments, cells from multiple samples from the same individual arecombined, which reduces the amount of work compared to analyzing thesamples separately. Combining multiple samples can also allow multipletissues to be tested for cancer simultaneously (which can be used toprovide or more thorough screening for cancer or to determine whethercancer may have metastasized to other tissues).

In some embodiments, the sample contains a single cell or a small numberof cells, such as 2, 3, 5, 6, 7, 8, 9, or 10 cells. In some embodiments,the sample has between 1 to 100, 100 to 500, or 500 to 1,000 cells,inclusive. In some embodiments, the sample contains 1 to 10 picograms,10 to 100 picograms, 100 picograms to 1 nanogram, 1 to 10 nanograms, 10to 100 nanograms, or 100 nanograms to 1 microgram of RNA and/or DNA,inclusive.

In some embodiments, the sample is embedded in parafilm. In someembodiments, the sample is preserved with a preservative such asformaldehyde and optionally encased in paraffin, which may causecross-linking of the DNA such that less of it is available for PCR. Insome embodiments, the sample is a formaldehyde fixed-paraffin embedded(FFPE) sample. In some embodiments, the sample is a fresh sample (suchas a sample obtained with 1 or 2 days of analysis). In some embodiments,the sample is frozen prior to analysis. In some embodiments, the sampleis a historical sample.

These samples can be used in any of the methods of the invention.

Exemplary Sample Preparation Methods

In some embodiments, the method includes isolating or purifying the DNAand/or RNA. There are a number of standard procedures known in the artto accomplish such an end. In some embodiments, the sample may becentrifuged to separate various layers. In some embodiments, the DNA orRNA may be isolated using filtration. In some embodiments, thepreparation of the DNA or RNA may involve amplification, separation,purification by chromatography, liquid liquid separation, isolation,preferential enrichment, preferential amplification, targetedamplification, or any of a number of other techniques either known inthe art or described herein. In some embodiments for the isolation ofDNA, RNase is used to degrade RNA. In some embodiments for the isolationof RNA, DNase (such as DNase I from Invitrogen, Carlsbad, Calif., USA)is used to degrade DNA. In some embodiments, an RNeasy mini kit(Qiagen), is used to isolate RNA according to the manufacturer'sprotocol. In some embodiments, small RNA molecules are isolated usingthe mirVana PARIS kit (Ambion, Austin, Tex., USA) according to themanufacturer's protocol (Gu et al., J. Neurochem. 122:641-649, 2012,which is hereby incorporated by reference in its entirety). Theconcentration and purity of RNA may optionally be determined usingNanovue (GE Healthcare, Piscataway, N.J., USA), and RNA integrity mayoptionally be measured by use of the 2100 Bioanalyzer (AgilentTechnologies, Santa Clara, Calif., USA) (Gu et al., J. Neurochem.122:641-649, 2012, which is hereby incorporated by reference in itsentirety). In some embodiments, TRIZOL or RNAlater (Ambion) is used tostabilize RNA during storage.

In some embodiments, universal tagged adaptors are added to make alibrary. Prior to ligation, sample DNA may be blunt ended, and then asingle adenosine base is added to the 3-prime end. Prior to ligation theDNA may be cleaved using a restriction enzyme or some other cleavagemethod. During ligation the 3-prime adenosine of the sample fragmentsand the complementary 3-prime tyrosine overhang of adaptor can enhanceligation efficiency. In some embodiments, adaptor ligation is performedusing the ligation kit found in the AGILENT SURESELECT kit. In someembodiments, the library is amplified using universal primers. In anembodiment, the amplified library is fractionated by size separation orby using products such as AGENCOURT AMPURE beads or other similarmethods. In some embodiments, PCR amplification is used to amplifytarget loci. In some embodiments, the amplified DNA is sequenced (suchas sequencing using an ILLUMINA IIGAX or HiSeq sequencer). In someembodiments, the amplified DNA is sequenced from each end of theamplified DNA to reduce sequencing errors. If there is a sequence errorin a particular base when sequencing from one end of the amplified DNA,there is less likely to be a sequence error in the complementary basewhen sequencing from the other side of the amplified DNA (compared tosequencing multiple times from the same end of the amplified DNA).

In some embodiments, whole genome application (WGA) is used to amplify anucleic acid sample. There are a number of methods available for WGA:ligation-mediated PCR (LM-PCR), degenerate oligonucleotide primer PCR(DOP-PCR), and multiple displacement amplification (MDA). In LM-PCR,short DNA sequences called adapters are ligated to blunt ends of DNA.These adapters contain universal amplification sequences, which are usedto amplify the DNA by PCR. In DOP-PCR, random primers that also containuniversal amplification sequences are used in a first round of annealingand PCR. Then, a second round of PCR is used to amplify the sequencesfurther with the universal primer sequences. MDA uses the phi-29polymerase, which is a highly processive and non-specific enzyme thatreplicates DNA and has been used for single-cell analysis. In someembodiments, WGA is not performed.

In some embodiments, selective amplification or enrichment are used toamplify or enrich target loci. In some embodiments, the amplificationand/or selective enrichment technique may involve PCR such as ligationmediated PCR, fragment capture by hybridization, Molecular InversionProbes, or other circularizing probes. In some embodiments, real-timequantitative PCR (RT-qPCR), digital PCR, or emulsion PCR, single allelebase extension reaction followed by mass spectrometry are used (Hung etal., J Clin Pathol 62:308-313, 2009, which is hereby incorporated byreference in its entirety). In some embodiments, capture byhybridization with hybrid capture probes is used to preferentiallyenrich the DNA. In some embodiments, methods for amplification orselective enrichment may involve using probes where, upon correcthybridization to the target sequence, the 3-prime end or 5-prime end ofa nucleotide probe is separated from the polymorphic site of apolymorphic allele by a small number of nucleotides. This separationreduces preferential amplification of one allele, termed allele bias.This is an improvement over methods that involve using probes where the3-prime end or 5-prime end of a correctly hybridized probe are directlyadjacent to or very near to the polymorphic site of an allele. In anembodiment, probes in which the hybridizing region may or certainlycontains a polymorphic site are excluded. Polymorphic sites at the siteof hybridization can cause unequal hybridization or inhibithybridization altogether in some alleles, resulting in preferentialamplification of certain alleles. These embodiments are improvementsover other methods that involve targeted amplification and/or selectiveenrichment in that they better preserve the original allele frequenciesof the sample at each polymorphic locus, whether the sample is puregenomic sample from a single individual or mixture of individuals

In some embodiments, PCR (referred to as mini-PCR) is used to generatevery short amplicons (U.S. application Ser. No. 13/683,604, filed Nov.21, 2012, U.S. Publication No. 2013/0123120, U.S. application Ser. No.13/300,235, filed Nov. 18, 2011, U.S. Publication No 2012/0270212, filedNov. 18, 2011, and U.S. Ser. No. 61/994,791, filed May 16, 2014, whichare each hereby incorporated by reference in its entirety). cfDNA (suchas fetal cfDNA in maternal serum or necroptically- orapoptotically-released cancer cfDNA) is highly fragmented. For fetalcfDNA, the fragment sizes are distributed in approximately a Gaussianfashion with a mean of 160 bp, a standard deviation of 15 bp, a minimumsize of about 100 bp, and a maximum size of about 220 bp. Thepolymorphic site of one particular target locus may occupy any positionfrom the start to the end among the various fragments originating fromthat locus. Because cfDNA fragments are short, the likelihood of bothprimer sites being present the likelihood of a fragment of length Lcomprising both the forward and reverse primers sites is the ratio ofthe length of the amplicon to the length of the fragment. Under idealconditions, assays in which the amplicon is 45, 50, 55, 60, 65, or 70 bpwill successfully amplify from 72%, 69%, 66%, 63%, 59%, or 56%,respectively, of available template fragment molecules. In certainembodiments that relate most preferably to cfDNA from samples ofindividuals suspected of having cancer, the cfDNA is amplified usingprimers that yield a maximum amplicon length of 85, 80, 75 or 70 bp, andin certain preferred embodiments 75 bp, and that have a meltingtemperature between 50 and 65° C., and in certain preferred embodiments,between 54-60.5° C. The amplicon length is the distance between the5-prime ends of the forward and reverse priming sites. Amplicon lengththat is shorter than typically used by those known in the art may resultin more efficient measurements of the desired polymorphic loci by onlyrequiring short sequence reads. In an embodiment, a substantial fractionof the amplicons are less than 100 bp, less than 90 bp, less than 80 bp,less than 70 bp, less than 65 bp, less than 60 bp, less than 55 bp, lessthan 50 bp, or less than 45 bp.

In some embodiments, amplification is performed using direct multiplexedPCR, sequential PCR, nested PCR, doubly nested PCR, one-and-a-half sidednested PCR, fully nested PCR, one sided fully nested PCR, one-sidednested PCR, hemi-nested PCR, hemi-nested PCR, triply hemi-nested PCR,semi-nested PCR, one sided semi-nested PCR, reverse semi-nested PCRmethod, or one-sided PCR, which are described in U.S. application Ser.No. 13/683,604, filed Nov. 21, 2012, U.S. Publication No. 2013/0123120,U.S. application Ser. No. 13/300,235, filed Nov. 18, 2011, U.S.Publication No 2012/0270212, and U.S. Ser. No. 61/994,791, filed May 16,2014, which are hereby incorporated by reference in their entirety. Ifdesired, any of these methods can be used for mini-PCR.

If desired, the extension step of the PCR amplification may be limitedfrom a time standpoint to reduce amplification from fragments longerthan 200 nucleotides, 300 nucleotides, 400 nucleotides, 500 nucleotidesor 1,000 nucleotides. This may result in the enrichment of fragmented orshorter DNA (such as fetal DNA or DNA from cancer cells that haveundergone apoptosis or necrosis) and improvement of test performance.

In some embodiments, multiplex PCR is used. In some embodiments, themethod of amplifying target loci in a nucleic acid sample involves (i)contacting the nucleic acid sample with a library of primers thatsimultaneously hybridize to least 100; 200; 500; 750; 1,000; 2,000;5,000; 7,500; 10,000; 20,000; 25,000; 30,000; 40,000; 50,000; 75,000; or100,000 different target loci to produce a reaction mixture; and (ii)subjecting the reaction mixture to primer extension reaction conditions(such as PCR conditions) to produce amplified products that includetarget amplicons. In some embodiments, at least 50, 60, 70, 80, 90, 95,96, 97, 98, 99, or 99.5% of the targeted loci are amplified. In variousembodiments, less than 60, 50, 40, 30, 20, 10, 5, 4, 3, 2, 1, 0.5, 0.25,0.1, or 0.05% of the amplified products are primer dimers. In someembodiments, the primers are in solution (such as being dissolved in theliquid phase rather than in a solid phase). In some embodiments, theprimers are in solution and are not immobilized on a solid support. Insome embodiments, the primers are not part of a microarray. In someembodiments, the primers do not include molecular inversion probes(MIPs).

In some embodiments, two or more (such as 3 or 4) target amplicons (suchas amplicons from the miniPCR method disclosed herein) are ligatedtogether and then the ligated products are sequenced. Combining multipleamplicons into a single ligation product increases the efficiency of thesubsequent sequencing step. In some embodiments, the target ampliconsare less than 150, 100, 90, 75, or 50 base pairs in length before theyare ligated. The selective enrichment and/or amplification may involvetagging each individual molecule with different tags, molecularbarcodes, tags for amplification, and/or tags for sequencing. In someembodiments, the amplified products are analyzed by sequencing (such asby high throughput sequencing) or by hybridization to an array, such asa SNP array, the ILLUMINA INFINIUM array, or the AFFYMETRIX gene chip.In some embodiments, nanopore sequencing is used, such as the nanoporesequencing technology developed by Genia (see, for example, the worldwide web at geniachip.com/technology, which is hereby incorporated byreference in its entirety). In some embodiments, duplex sequencing isused (Schmitt et al., “Detection of ultra-rare mutations bynext-generation sequencing,” Proc Natl Acad Sci USA. 109(36):14508-14513, 2012, which is hereby incorporated by reference in itsentirety). This approach greatly reduces errors by independently taggingand sequencing each of the two strands of a DNA duplex. As the twostrands are complementary, true mutations are found at the same positionin both strands. In contrast, PCR or sequencing errors result inmutations in only one strand and can thus be discounted as technicalerror. In some embodiments, the method entails tagging both strands ofduplex DNA with a random, yet complementary double-stranded nucleotidesequence, referred to as a Duplex Tag. Double-stranded tag sequences areincorporated into standard sequencing adapters by first introducing asingle-stranded randomized nucleotide sequence into one adapter strandand then extending the opposite strand with a DNA polymerase to yield acomplementary, double-stranded tag. Following ligation of taggedadapters to sheared DNA, the individually labeled strands are PCRamplified from asymmetric primer sites on the adapter tails andsubjected to paired-end sequencing. In some embodiments, a sample (suchas a DNA or RNA sample) is divided into multiple fractions, such asdifferent wells (e.g., wells of a WaferGen SmartChip). Dividing thesample into different fractions (such as at least 5, 10, 20, 50, 75,100, 150, 200, or 300 fractions) can increase the sensitivity of theanalysis since the percent of molecules with a mutation are higher insome of the wells than in the overall sample. In some embodiments, eachfraction has less than 500, 400, 200, 100, 50, 20, 10, 5, 2, or 1 DNA orRNA molecules. In some embodiments, the molecules in each fraction aresequenced separately. In some embodiments, the same barcode (such as arandom or non-human sequence) is added to all the molecules in the samefraction (such as by amplification with a primer containing the barcodeor by ligation of a barcode), and different barcodes are added tomolecules in different fractions. The barcoded molecules can be pooledand sequenced together. In some embodiments, the molecules are amplifiedbefore they are pooled and sequenced, such as by using nested PCR. Insome embodiments, one forward and two reverse primers, or two forwardand one reverse primers are used.

In some embodiments, a mutation (such as an SNV or CNV) that is presentin less than 10, 5, 2, 1, 0.5, 0.1, 0.05, 0.01, or 0.005% of the DNA orRNA molecules in a sample (such as a sample of cfDNA or cfRNA) isdetected (or is capable of being detected). In some embodiments, amutation (such as an SNV or CNV) that is present in less than 1,000,500, 100, 50, 20, 10, 5, 4, 3, or 2 original DNA or RNA molecules(before amplification) in a sample (such as a sample of cfDNA or cfRNAfrom, e.g., a blood sample) is detected (or is capable of beingdetected). In some embodiments, a mutation (such as an SNV or CNV) thatis present in only 1 original DNA or RNA molecule (before amplification)in a sample (such as a sample of cfDNA or cfRNA from, e.g., a bloodsample) is detected (or is capable of being detected).

For example, if the limit of detection of a mutation (such as a singlenucleotide variant (SNV)) is 0.1%, a mutation present at 0.01% can bedetected by dividing the fraction into multiple, fractions such as 100wells. Most of the wells have no copies of the mutation. For the fewwells with the mutation, the mutation is at a much higher percentage ofthe reads. In one example, there are 20,000 initial copies of DNA fromthe target locus, and two of those copies include a SNV of interest. Ifthe sample is divided into 100 wells, 98 wells have the SNV, and 2 wellshave the SNV at 0.5%. The DNA in each well can be barcoded, amplified,pooled with DNA from the other wells, and sequenced. Wells without theSNV can be used to measure the background amplification/sequencing errorrate to determine if the signal from the outlier wells is above thebackground level of noise.

In some embodiments, the amplified products are detected using an array,such as an array especially a microarray with probes to one or morechromosomes of interest (e.g., chromosome 13, 18, 21, X, Y, or anycombination thereof). It will be understood for example, that acommercially available SNP detection microarray could be used such as,for example, the Illumina (San Diego, Calif.) GoldenGate, DASL,Infinium, or CytoSNP-12 genotyping assay, or a SNP detection microarrayproduct from Affymetrix, such as the OncoScan microarray. In someembodiments, phased genetic data for one or both biological parents ofthe embryo or fetus is used to increase the accuracy of analysis ofarray data from a single cell.

In some embodiments involving sequencing, the depth of read is thenumber of sequencing reads that map to a given locus. The depth of readmay be normalized over the total number of reads. In some embodimentsfor depth of read of a sample, the depth of read is the average depth ofread over the targeted loci. In some embodiments for the depth of readof a locus, the depth of read is the number of reads measured by thesequencer mapping to that locus. In general, the greater the depth ofread of a locus, the closer the ratio of alleles at the locus tend to beto the ratio of alleles in the original sample of DNA. Depth of read canbe expressed in variety of different ways, including but not limited tothe percentage or proportion. Thus, for example in a highly parallel DNAsequencer such as an Illumina HISEQ, which, e.g., produces a sequence of1 million clones, the sequencing of one locus 3,000 times results in adepth of read of 3,000 reads at that locus. The proportion of reads atthat locus is 3,000 divided by 1 million total reads, or 0.3% of thetotal reads.

In some embodiments, allelic data is obtained, wherein the allelic dataincludes quantitative measurement(s) indicative of the number of copiesof a specific allele of a polymorphic locus. In some embodiments, theallelic data includes quantitative measurement(s) indicative of thenumber of copies of each of the alleles observed at a polymorphic locus.Typically, quantitative measurements are obtained for all possiblealleles of the polymorphic locus of interest. For example, any of themethods discussed in the preceding paragraphs for determining the allelefor a SNP or SNV locus, such as for example, microarrays, qPCR, DNAsequencing, such as high throughput DNA sequencing, can be used togenerate quantitative measurements of the number of copies of a specificallele of a polymorphic locus. This quantitative measurement is referredto herein as allelic frequency data or measured genetic allelic data.Methods using allelic data are sometimes referred to as quantitativeallelic methods; this is in contrast to quantitative methods whichexclusively use quantitative data from non-polymorphic loci, or frompolymorphic loci but without regard to allelic identity. When theallelic data is measured using high-throughput sequencing, the allelicdata typically include the number of reads of each allele mapping to thelocus of interest.

In some embodiments, non-allelic data is obtained, wherein thenon-allelic data includes quantitative measurement(s) indicative of thenumber of copies of a specific locus. The locus may be polymorphic ornon-polymorphic. In some embodiments when the locus is non-polymorphic,the non-allelic data does not contain information about the relative orabsolute quantity of the individual alleles that may be present at thatlocus. Methods using non-allelic data only (that is, quantitative datafrom non-polymorphic alleles, or quantitative data from polymorphic locibut without regard to the allelic identity of each fragment) arereferred to as quantitative methods. Typically, quantitativemeasurements are obtained for all possible alleles of the polymorphiclocus of interest, with one value associated with the measured quantityfor all of the alleles at that locus, in total. Non-allelic data for apolymorphic locus may be obtained by summing the quantitative allelicfor each allele at that locus. When the allelic data is measured usinghigh-throughput sequencing, the non-allelic data typically includes thenumber of reads of mapping to the locus of interest. The sequencingmeasurements could indicate the relative and/or absolute number of eachof the alleles present at the locus, and the non-allelic data includesthe sum of the reads, regardless of the allelic identity, mapping to thelocus. In some embodiments the same set of sequencing measurements canbe used to yield both allelic data and non-allelic data. In someembodiments, the allelic data is used as part of a method to determinecopy number at a chromosome of interest, and the produced non-allelicdata can be used as part of a different method to determine copy numberat a chromosome of interest. In some embodiments, the two methods arestatistically orthogonal, and are combined to give a more accuratedetermination of the copy number at the chromosome of interest.

In some embodiments obtaining genetic data includes (i) acquiring DNAsequence information by laboratory techniques, e.g., by the use of anautomated high throughput DNA sequencer, or (ii) acquiring informationthat had been previously obtained by laboratory techniques, wherein theinformation is electronically transmitted, e.g., by a computer over theinternet or by electronic transfer from the sequencing device.

Additional exemplary sample preparation, amplification, andquantification methods are described in U.S. application Ser. No.13/683,604, filed Nov. 21, 2012 (U.S. Publication No. 2013/0123120 andU.S. Ser. No. 61/994,791, filed May 16, 2014, which is herebyincorporated by reference in its entirety). These methods can be usedfor analysis of any of the samples disclosed herein.

Exemplary Quantification Methods for Cell Free DNA

If desired, that amount or concentration of cfDNA or cfRNA can bemeasured using standard methods. In some embodiments, the amount orconcentration of cell-free mitochondrial DNA (cf mDNA) is determined. Insome embodiments, the amount or concentration of cell-free DNA thatoriginated from nuclear DNA (cf nDNA) is determined. In someembodiments, the amount or concentration of cf mDNA and cf nDNA aredetermined simultaneously.

In some embodiments, qPCR is used to measure cf nDNA and/or cfm DNA(Kohler et al. “Levels of plasma circulating cell free nuclear andmitochondrial DNA as potential biomarkers for breast tumors.” Mol Cancer8:105, 2009, 8:doi:10.1186/1476-4598-8-105, which is hereby incorporatedby reference in its entirety). For example, one or more loci from cfnDNA (such as Glyceraldehyd-3-phosphat-dehydrogenase, GAPDH) and one ormore loci from cf mDNA (ATPase 8, MTATP 8) can be measured usingmultiplex qPCR. In some embodiments, fluorescence-labelled PCR is usedto measure cf nDNA and/or cf mDNA (Schwarzenbach et al., “Evaluation ofcell-free tumour DNA and RNA in patients with breast cancer and benignbreast disease.” Mol Biosys 7:2848-2854, 2011, which is herebyincorporated by reference in its entirety). If desired, the normalitydistribution of the data can be determined using standard methods, suchas the Shapiro-Wilk-Test. If desired, cf nDNA and mDNA levels can becompared using standard methods, such as the Mann-Whitney-U-Test. Insome embodiments, cf nDNA and/or mDNA levels are compared with otherestablished prognostic factors using standard methods, such as theMann-Whitney-U-Test or the Kruskal-Wallis-Test.

Exemplary RNA Amplification, Quantification, and Analysis Methods

Any of the following exemplary methods may be used to amplify andoptionally quantify RNA, such as such as cfRNA, cellular RNA,cytoplasmic RNA, coding cytoplasmic RNA, non-coding cytoplasmic RNA,mRNA, miRNA, mitochondrial RNA, rRNA, or tRNA. In some embodiments, themiRNA is any of the miRNA molecules listed in the miRBase databaseavailable at the world wide web at mirbase.org, which is herebyincorporated by reference in its entirety. Exemplary miRNA moleculesinclude miR-509; miR-21, and miR-146a.

In some embodiments, reverse-transcriptase multiplex ligation-dependentprobe amplification (RT-MLPA) is used to amplify RNA. In someembodiments, each set of hybridizing probes consists of two shortsynthetic oligonucleotides spanning the SNP and one long oligonucleotide(Li et al., Arch Gynecol Obstet. “Development of noninvasive prenataldiagnosis of trisomy 21 by RT-MLPA with a new set of SNP markers,” Jul.5, 2013, DOI 10.1007/s00404-013-2926-5; Schouten et al. “Relativequantification of 40 nucleic acid sequences by multiplexligation-dependent probe amplification.” Nucleic Acids Res 30:e57, 2002;Deng et al. (2011) “Non-invasive prenatal diagnosis of trisomy 21 byreverse transcriptase multiplex ligation-dependent probe amplification,”Clin, Chem. Lab Med. 49:641-646, 2011, which are each herebyincorporated by reference in its entirety).

In some embodiments, RNA is amplified with reverse-transcriptase PCR. Insome embodiments, RNA is amplified with real-time reverse-transcriptasePCR, such as one-step real-time reverse-transcriptase PCR with SYBRGREEN I as previously described (Li et al., Arch Gynecol Obstet.“Development of noninvasive prenatal diagnosis of trisomy 21 by RT-MLPAwith a new set of SNP markers,” Jul. 5, 2013, DOI10.1007/s00404-013-2926-5; Lo et al., “Plasma placental RNA allelicratio permits noninvasive prenatal chromosomal aneuploidy detection,”Nat Med 13:218-223, 2007; Tsui et al., Systematic micro-array basedidentification of placental mRNA in maternal plasma: towardsnon-invasive prenatal gene expression profiling. J Med Genet 41:461-467,2004; Gu et al., J. Neurochem. 122:641-649, 2012, which are each herebyincorporated by reference in its entirety).

In some embodiments, a microarray is used to detect RNA. For example, ahuman miRNA microarray from Agilent Technologies can be used accordingto the manufacturer's protocol. Briefly, isolated RNA isdephosphorylated and ligated with pCp-Cy3. Labeled RNA is purified andhybridized to miRNA arrays containing probes for human mature miRNAs onthe basis of Sanger miRBase release 14.0. The arrays is washed andscanned with use of a microarray scanner (G2565BA, AgilentTechnologies). The intensity of each hybridization signal is evaluatedby Agilent extraction software v9.5.3. The labeling, hybridization, andscanning may be performed according to the protocols in the AgilentmiRNA microarray system (Gu et al., J. Neurochem. 122:641-649, 2012,which is hereby incorporated by reference in its entirety).

In some embodiments, a TaqMan assay is used to detect RNA. An exemplaryassay is the TaqMan Array Human MicroRNA Panel v1.0 (Early Access)(Applied Biosystems), which contains 157 TaqMan MicroRNA Assays,including the respective reverse-transcription primers, PCR primers, andTaqMan probe (Chim et al., “Detection and characterization of placentalmicroRNAs in maternal plasma,” Clin Chem. 54(3):482-90, 2008, which ishereby incorporated by reference in its entirety).

If desired, the mRNA splicing pattern of one or more mRNAs can bedetermined using standard methods (Fackenthal1 and Godley, DiseaseModels & Mechanisms 1: 37-42, 2008, doi:10.1242/dmm.000331, which ishereby incorporated by reference in its entirety). For example,high-density microarrays and/or high-throughput DNA sequencing can beused to detect mRNA splice variants.

In some embodiments, whole transcriptome shotgun sequencing or an arrayis used to measure the transcriptome.

Exemplary Amplification Methods

Improved PCR amplification methods have also been developed thatminimize or prevent interference due to the amplification of nearby oradjacent target loci in the same reaction volume (such as part of thesample multiplex PCR reaction that simultaneously amplifies all thetarget loci). These methods can be used to simultaneously amplify nearbyor adjacent target loci, which is faster and cheaper than having toseparate nearby target loci into different reaction volumes so that theycan be amplified separately to avoid interference.

In some embodiments, the amplification of target loci is performed usinga polymerase (e.g., a DNA polymerase, RNA polymerase, or reversetranscriptase) with low 5′→3′ exonuclease and/or low strand displacementactivity. In some embodiments, the low level of 5′→3′ exonucleasereduces or prevents the degradation of a nearby primer (e.g., anunextended primer or a primer that has had one or more nucleotides addedto during primer extension). In some embodiments, the low level ofstrand displacement activity reduces or prevents the displacement of anearby primer (e.g., an unextended primer or a primer that has had oneor more nucleotides added to it during primer extension). In someembodiments, target loci that are adjacent to each other (e.g., no basesbetween the target loci) or nearby (e.g., loci are within 50, 40, 30,20, 15, 10, 9, 8, 7, 6, 5, 4, 3, 2, or 1 base) are amplified. In someembodiments, the 3′ end of one locus is within 50, 40, 30, 20, 15, 10,9, 8, 7, 6, 5, 4, 3, 2, or 1 base of the 5′ end of next downstreamlocus.

In some embodiments, at least 100, 200, 500, 750, 1,000; 2,000; 5,000;7,500; 10,000; 20,000; 25,000; 30,000; 40,000; 50,000; 75,000; or100,000 different target loci are amplified, such as by the simultaneousamplification in one reaction volume. In some embodiments, at least 50,60, 70, 80, 90, 95, 96, 97, 98, 99, or 99.5% of the amplified productsare target amplicons. In various embodiments, the amount of amplifiedproducts that are target amplicons is between 50 to 99.5%, such asbetween 60 to 99%, 70 to 98%, 80 to 98%, 90 to 99.5%, or 95 to 99.5%,inclusive. In some embodiments, at least 50, 60, 70, 80, 90, 95, 96, 97,98, 99, or 99.5% of the targeted loci are amplified (e.g., amplified atleast 5, 10, 20, 30, 50, or 100-fold compared to the amount prior toamplification), such as by the simultaneous amplification in onereaction volume. In various embodiments, the amount target loci that areamplified (e.g., amplified at least 5, 10, 20, 30, 50, or 100-foldcompared to the amount prior to amplification) is between 50 to 99.5%,such as between 60 to 99%, 70 to 98%, 80 to 99%, 90 to 99.5%, 95 to99.9%, or 98 to 99.99% inclusive. In some embodiments, fewer non-targetamplicons are produced, such as fewer amplicons formed from a forwardprimer from a first primer pair and a reverse primer from a secondprimer pair. Such undesired non-target amplicons can be produced usingprior amplification methods if, e.g., the reverse primer from the firstprimer pair and/or the forward primer from the second primer pair aredegraded and/or displaced.

In some embodiments, these methods allows longer extension times to beused since the polymerase bound to a primer being extended is lesslikely to degrade and/or displace a nearby primer (such as the nextdownstream primer) given the low 5′→3′ exonuclease and/or low stranddisplacement activity of the polymerase. In various embodiments,reaction conditions (such as the extension time and temperature) areused such that the extension rate of the polymerase allows the number ofnucleotides that are added to a primer being extended to be equal to orgreater than 80, 90, 95, 100, 110, 120, 130, 140, 150, 175, or 200% ofthe number of nucleotides between the 3′ end of the primer binding siteand the 5′end of the next downstream primer binding site on the samestrand.

In some embodiments, a DNA polymerase is used produce DNA ampliconsusing DNA as a template. In some embodiments, a RNA polymerase is usedproduce RNA amplicons using DNA as a template. In some embodiments, areverse transcriptase is used produce cDNA amplicons using RNA as atemplate.

In some embodiments, the low level of 5′→3′ exonuclease of thepolymerase is less than 80, 70, 60, 50, 40, 30, 20, 10, 5, 1, or 0.1% ofthe activity of the same amount of Thermus aquaticus polymerase (“Taq”polymerase, which is a commonly used DNA polymerase from a thermophilicbacterium, PDB 1BGX, EC 2.7.7.7, Murali et al., “Crystal structure ofTaq DNA polymerase in complex with an inhibitory Fab: the Fab isdirected against an intermediate in the helix-coil dynamics of theenzyme,” Proc. Natl. Acad. Sci. USA 95:12562-12567, 1998, which ishereby incorporated by reference in its entirety) under the sameconditions. In some embodiments, the low level of strand displacementactivity of the polymerase is less than 80, 70, 60, 50, 40, 30, 20, 10,5, 1, or 0.1% of the activity of the same amount of Taq polymerase underthe same conditions.

In some embodiments, the polymerase is a PUSHION DNA polymerase, such asPHUSION High Fidelity DNA polymerase (M0530S, New England BioLabs, Inc.)or PHUSION Hot Start Flex DNA polymerase (M0535S, New England BioLabs,Inc.; Frey and Suppman BioChemica. 2:34-35, 1995; Chester and MarshakAnalytical Biochemistry. 209:284-290, 1993, which are each herebyincorporated by reference in its entirety). The PHUSION DNA polymeraseis a Pyrococcus-like enzyme fused with a processivity-enhancing domain.PHUSION DNA polymerase possesses 5′→3′ polymerase activity and 3′→5′exonuclease activity, and generates blunt-ended products. PHUSION DNApolymerase lacks 5′→3′ exonuclease activity and strand displacementactivity.

In some embodiments, the polymerase is a Q5® DNA Polymerase, such as Q5®High-Fidelity DNA Polymerase (M0491S, New England BioLabs, Inc.) or Q5®Hot Start High-Fidelity DNA Polymerase (M0493S, New England BioLabs,Inc.). Q5® High-Fidelity DNA polymerase is a high-fidelity,thermostable, DNA polymerase with 3′→5′ exonuclease activity, fused to aprocessivity-enhancing Sso7d domain. Q5® High-Fidelity DNA polymeraselacks 5′→3′ exonuclease activity and strand displacement activity.

In some embodiments, the polymerase is a T4 DNA polymerase (M0203S, NewEngland BioLabs, Inc.; Tabor and Struh. (1989). “DNA-Dependent DNAPolymerases,” In Ausebel et al. (Ed.), Current Protocols in MolecularBiology. 3.5.10-3.5.12. New York: John Wiley & Sons, Inc., 1989;Sambrook et al. Molecular Cloning: A Laboratory Manual. (2nd ed.),5.44-5.47. Cold Spring Harbor: Cold Spring Harbor Laboratory Press,1989, which are each hereby incorporated by reference in its entirety).T4 DNA Polymerase catalyzes the synthesis of DNA in the 5′→3′ directionand requires the presence of template and primer. This enzyme has a3′→5′ exonuclease activity which is much more active than that found inDNA Polymerase I. T4 DNA polymerase lacks 5′→3′ exonuclease activity andstrand displacement activity.

In some embodiments, the polymerase is a Sulfolobus DNA Polymerase IV(M0327S, New England BioLabs, Inc.; (Boudsocq, et al. (2001). NucleicAcids Res., 29:4607-4616, 2001; McDonald, et al. (2006). Nucleic AcidsRes., 34:1102-1111, 2006, which are each hereby incorporated byreference in its entirety). Sulfolobus DNA Polymerase IV is athermostable Y-family lesion-bypass DNA Polymerase that efficientlysynthesizes DNA across a variety of DNA template lesions McDonald, J. P.et al. (2006). Nucleic Acids Res., 34, 1102-1111, which is herebyincorporated by reference in its entirety). Sulfolobus DNA Polymerase IVlacks 5′→3′ exonuclease activity and strand displacement activity.

In some embodiments, if a primer binds a region with a SNP, the primermay bind and amplify the different alleles with different efficienciesor may only bind and amplify one allele. For subjects who areheterozygous, one of the alleles may not be amplified by the primer. Insome embodiments, a primer is designed for each allele. For example, ifthere are two alleles (e.g., a biallelic SNP), then two primers can beused to bind the same location of a target locus (e.g., a forward primerto bind the “A” allele and a forward primer to bind the “B” allele).Standard methods, such as the dbSNP database, can be used to determinethe location of known SNPs, such as SNP hot spots that have a highheterozygosity rate.

In some embodiments, the amplicons are similar in size. In someembodiments, the range of the length of the target amplicons is lessthan 100, 75, 50, 25, 15, 10, or 5 nucleotides. In some embodiments(such as the amplification of target loci in fragmented DNA or RNA), thelength of the target amplicons is between 50 and 100 nucleotides, suchas between 60 and 80 nucleotides, or 60 and 75 nucleotides, inclusive.In some embodiments (such as the amplification of multiple target locithroughout an exon or gene), the length of the target amplicons isbetween 100 and 500 nucleotides, such as between 150 and 450nucleotides, 200 and 400 nucleotides, 200 and 300 nucleotides, or 300and 400 nucleotides, inclusive.

In some embodiments, multiple target loci are simultaneously amplifiedusing a primer pair that includes a forward and reverse primer for eachtarget locus to be amplified in that reaction volume. In someembodiments, one round of PCR is performed with a single primer pertarget locus, and then a second round of PCR is performed with a primerpair per target locus. For example, the first round of PCR may beperformed with a single primer per target locus such that all theprimers bind the same strand (such as using a forward primer for eachtarget locus). This allows the PCR to amplify in a linear manner andreduces or eliminates amplification bias between amplicons due tosequence or length differences. In some embodiments, the amplicons arethen amplified using a forward and reverse primer for each target locus.

Exemplary Primer Design Methods

If desired, multiplex PCR may be performed using primers with adecreased likelihood of forming primer dimers. In particular, highlymultiplexed PCR can often result in the production of a very highproportion of product DNA that results from unproductive side reactionssuch as primer dimer formation. In an embodiment, the particular primersthat are most likely to cause unproductive side reactions may be removedfrom the primer library to give a primer library that will result in agreater proportion of amplified DNA that maps to the genome. The step ofremoving problematic primers, that is, those primers that areparticularly likely to firm dimers has unexpectedly enabled extremelyhigh PCR multiplexing levels for subsequent analysis by sequencing.

There are a number of ways to choose primers for a library where theamount of non-mapping primer dimer or other primer mischief products areminimized. Empirical data indicate that a small number of ‘bad’ primersare responsible for a large amount of non-mapping primer dimer sidereactions. Removing these ‘bad’ primers can increase the percent ofsequence reads that map to targeted loci. One way to identify the ‘bad’primers is to look at the sequencing data of DNA that was amplified bytargeted amplification; those primer dimers that are seen with greatestfrequency can be removed to give a primer library that is significantlyless likely to result in side product DNA that does not map to thegenome. There are also publicly available programs that can calculatethe binding energy of various primer combinations, and removing thosewith the highest binding energy will also give a primer library that issignificantly less likely to result in side product DNA that does notmap to the genome.

In some embodiments for selecting primers, an initial library ofcandidate primers is created by designing one or more primers or primerpairs to candidate target loci. A set of candidate target loci (such asSNPs) can selected based on publicly available information about desiredparameters for the target loci, such as frequency of the SNPs within atarget population or the heterozygosity rate of the SNPs. In oneembodiment, the PCR primers may be designed using the Primer3 program(the worldwide web at primer3.sourceforge.net; libprimer3 release 2.2.3,which is hereby incorporated by reference in its entirety). If desired,the primers can be designed to anneal within a particular annealingtemperature range, have a particular range of GC contents, have aparticular size range, produce target amplicons in a particular sizerange, and/or have other parameter characteristics. Starting withmultiple primers or primer pairs per candidate target locus increasesthe likelihood that a primer or prime pair will remain in the libraryfor most or all of the target loci. In one embodiment, the selectioncriteria may require that at least one primer pair per target locusremains in the library. That way, most or all of the target loci will beamplified when using the final primer library. This is desirable forapplications such as screening for deletions or duplications at a largenumber of locations in the genome or screening for a large number ofsequences (such as polymorphisms or other mutations) associated with adisease or an increased risk for a disease. If a primer pair from thelibrary would produces a target amplicon that overlaps with a targetamplicon produced by another primer pair, one of the primer pairs may beremoved from the library to prevent interference.

In some embodiments, an “undesirability score” (higher scorerepresenting least desirability) is calculated (such as calculation on acomputer) for most or all of the possible combinations of two primersfrom a library of candidate primers. In various embodiments, anundesirability score is calculated for at least 80, 90, 95, 98, 99, or99.5% of the possible combinations of candidate primers in the library.Each undesirability score is based at least in part on the likelihood ofdimer formation between the two candidate primers. If desired, theundesirability score may also be based on one or more other parametersselected from the group consisting of heterozygosity rate of the targetlocus, disease prevalence associated with a sequence (e.g., apolymorphism) at the target locus, disease penetrance associated with asequence (e.g., a polymorphism) at the target locus, specificity of thecandidate primer for the target locus, size of the candidate primer,melting temperature of the target amplicon, GC content of the targetamplicon, amplification efficiency of the target amplicon, size of thetarget amplicon, and distance from the center of a recombinationhotspot. In some embodiments, the specificity of the candidate primerfor the target locus includes the likelihood that the candidate primerwill mis-prime by binding and amplifying a locus other than the targetlocus it was designed to amplify. In some embodiments, one or more orall the candidate primers that mis-prime are removed from the library.In some embodiments to increase the number of candidate primers tochoose from, candidate primers that may mis-prime are not removed fromthe library. If multiple factors are considered, the undesirabilityscore may be calculated based on a weighted average of the variousparameters. The parameters may be assigned different weights based ontheir importance for the particular application that the primers will beused for. In some embodiments, the primer with the highestundesirability score is removed from the library. If the removed primeris a member of a primer pair that hybridizes to one target locus, thenthe other member of the primer pair may be removed from the library. Theprocess of removing primers may be repeated as desired. In someembodiments, the selection method is performed until the undesirabilityscores for the candidate primer combinations remaining in the libraryare all equal to or below a minimum threshold. In some embodiments, theselection method is performed until the number of candidate primersremaining in the library is reduced to a desired number.

In various embodiments, after the undesirability scores are calculated,the candidate primer that is part of the greatest number of combinationsof two candidate primers with an undesirability score above a firstminimum threshold is removed from the library. This step ignoresinteractions equal to or below the first minimum threshold since theseinteractions are less significant. If the removed primer is a member ofa primer pair that hybridizes to one target locus, then the other memberof the primer pair may be removed from the library. The process ofremoving primers may be repeated as desired. In some embodiments, theselection method is performed until the undesirability scores for thecandidate primer combinations remaining in the library are all equal toor below the first minimum threshold. If the number of candidate primersremaining in the library is higher than desired, the number of primersmay be reduced by decreasing the first minimum threshold to a lowersecond minimum threshold and repeating the process of removing primers.If the number of candidate primers remaining in the library is lowerthan desired, the method can be continued by increasing the firstminimum threshold to a higher second minimum threshold and repeating theprocess of removing primers using the original candidate primer library,thereby allowing more of the candidate primers to remain in the library.In some embodiments, the selection method is performed until theundesirability scores for the candidate primer combinations remaining inthe library are all equal to or below the second minimum threshold, oruntil the number of candidate primers remaining in the library isreduced to a desired number;

If desired, primer pairs that produce a target amplicon that overlapswith a target amplicon produced by another primer pair can be dividedinto separate amplification reactions. Multiple PCR amplificationreactions may be desirable for applications in which it is desirable toanalyze all of the candidate target loci (instead of omitting candidatetarget loci from the analysis due to overlapping target amplicons).

These selection methods minimize the number of candidate primers thathave to be removed from the library to achieve the desired reduction inprimer dimers. By removing a smaller number of candidate primers fromthe library, more (or all) of the target loci can be amplified using theresulting primer library.

Multiplexing large numbers of primers imposes considerable constraint onthe assays that can be included. Assays that unintentionally interactresult in spurious amplification products. The size constraints ofminiPCR may result in further constraints. In an embodiment, it ispossible to begin with a very large number of potential SNP targets(between about 500 to greater than 1 million) and attempt to designprimers to amplify each SNP. Where primers can be designed it ispossible to attempt to identify primer pairs likely to form spuriousproducts by evaluating the likelihood of spurious primer duplexformation between all possible pairs of primers using publishedthermodynamic parameters for DNA duplex formation. Primer interactionsmay be ranked by a scoring function related to the interaction andprimers with the worst interaction scores are eliminated until thenumber of primers desired is met. In cases where SNPs likely to beheterozygous are most useful, it is possible to also rank the list ofassays and select the most heterozygous compatible assays. Experimentshave validated that primers with high interaction scores are most likelyto form primer dimers. At high multiplexing it is not possible toeliminate all spurious interactions, but it is essential to remove theprimers or pairs of primers with the highest interaction scores insilico as they can dominate an entire reaction, greatly limitingamplification from intended targets. We have performed this procedure tocreate multiplex primer sets of up to and in some cases more than 10,000primers. The improvement due to this procedure is substantial, enablingamplification of more than 80%, more than 90%, more than 95%, more than98%, and even more than 99% on target products as determined bysequencing of all PCR products, as compared to 10% from a reaction inwhich the worst primers were not removed. When combined with a partialsemi-nested approach as previously described, more than 90%, and evenmore than 95% of amplicons may map to the targeted sequences.

Note that there are other methods for determining which PCR probes arelikely to form dimers. In an embodiment, analysis of a pool of DNA thathas been amplified using a non-optimized set of primers may besufficient to determine problematic primers. For example, analysis maybe done using sequencing, and those dimers which are present in thegreatest number are determined to be those most likely to form dimers,and may be removed. In an embodiment, the method of primer design may beused in combination with the mini-PCR method described herein.

The use of tags on the primers may reduce amplification and sequencingof primer dimer products. In some embodiments, the primer contains aninternal region that forms a loop structure with a tag. In particularembodiments, the primers include a 5′ region that is specific for atarget locus, an internal region that is not specific for the targetlocus and forms a loop structure, and a 3′ region that is specific forthe target locus. In some embodiments, the loop region may lie betweentwo binding regions where the two binding regions are designed to bindto contiguous or neighboring regions of template DNA. In variousembodiments, the length of the 3′ region is at least 7 nucleotides. Insome embodiments, the length of the 3′ region is between 7 and 20nucleotides, such as between 7 to 15 nucleotides, or 7 to 10nucleotides, inclusive. In various embodiments, the primers include a 5′region that is not specific for a target locus (such as a tag or auniversal primer binding site) followed by a region that is specific fora target locus, an internal region that is not specific for the targetlocus and forms a loop structure, and a 3′ region that is specific forthe target locus. Tag-primers can be used to shorten necessarytarget-specific sequences to below 20, below 15, below 12, and evenbelow 10 base pairs. This can be serendipitous with standard primerdesign when the target sequence is fragmented within the primer bindingsite or, or it can be designed into the primer design. Advantages ofthis method include: it increases the number of assays that can bedesigned for a certain maximal amplicon length, and it shortens the“non-informative” sequencing of primer sequence. It may also be used incombination with internal tagging.

In an embodiment, the relative amount of nonproductive products in themultiplexed targeted PCR amplification can be reduced by raising theannealing temperature. In cases where one is amplifying libraries withthe same tag as the target specific primers, the annealing temperaturecan be increased in comparison to the genomic DNA as the tags willcontribute to the primer binding. In some embodiments reduced primerconcentrations are used, optionally along with longer annealing times.In some embodiments the annealing times may be longer than 3 minutes,longer than 5 minutes, longer than 8 minutes, longer than 10 minutes,longer than 15 minutes, longer than 20 minutes, longer than 30 minutes,longer than 60 minutes, longer than 120 minutes, longer than 240minutes, longer than 480 minutes, and even longer than 960 minutes. Incertain illustrative embodiments, longer annealing times are used alongwith reduced primer concentrations. In various embodiments, longer thannormal extension times are used, such as greater than 3, 5, 8, 10, or 15minutes. In some embodiments, the primer concentrations are as low as 50nM, 20 nM, 10 nM, 5 nM, 1 nM, and lower than 1 nM. This surprisinglyresults in robust performance for highly multiplexed reactions, forexample 1,000-plex reactions, 2,000-plex reactions, 5,000-plexreactions, 10,000-plex reactions, 20,000-plex reactions, 50,000-plexreactions, and even 100,000-plex reactions. In an embodiment, theamplification uses one, two, three, four or five cycles run with longannealing times, followed by PCR cycles with more usual annealing timeswith tagged primers.

To select target locations, one may start with a pool of candidateprimer pair designs and create a thermodynamic model of potentiallyadverse interactions between primer pairs, and then use the model toeliminate designs that are incompatible with other the designs in thepool.

In an embodiment, the invention features a method of decreasing thenumber of target loci (such as loci that may contain a polymorphism ormutation associated with a disease or disorder or an increased risk fora disease or disorder such as cancer) and/or increasing the disease loadthat is detected (e.g., increasing the number of polymorphisms ormutations that are detected). In some embodiments, the method includesranking (such as ranking from highest to lowest) loci by frequency orreoccurrence of a polymorphism or mutation (such as a single nucleotidevariation, insertion, or deletion, or any of the other variationsdescribed herein) in each locus among subjects with the disease ordisorder such as cancer. In some embodiments, PCR primers are designedto some or all of the loci. During selection of PCR primers for alibrary of primers, primers to loci that have a higher frequency orreoccurrence (higher ranking loci) are favored over those with a lowerfrequency or reoccurrence (lower ranking loci). In some embodiments,this parameter is included as one of the parameters in the calculationof the undesirability scores described herein. If desired, primers (suchas primers to high ranking loci) that are incompatible with otherdesigns in the library can be included in a different PCR library/pool.In some embodiments, multiple libraries/pools (such as 2, 3, 4, 5 ormore) are used in separate PCR reactions to enable amplification of all(or a majority) of the loci represented by all the libraries/pools. Insome embodiment, this method is continued until sufficient primers areincluded in one or more libraries/pools such that the primers, inaggregate, enable the desired disease load to be captured for thedisease or disorder (e.g., such as by detection of at least 80, 85, 90,95, or 99% of the disease load).

Exemplary Primer Libraries

In one aspect, the invention features libraries of primers, such asprimers selected from a library of candidate primers using any of themethods of the invention. In some embodiments, the library includesprimers that simultaneously hybridize (or are capable of simultaneouslyhybridizing) to or that simultaneously amplify (or are capable ofsimultaneously amplifying) at least 100; 200; 500; 750; 1,000; 2,000;5,000; 7,500; 10,000; 20,000; 25,000; 30,000; 40,000; 50,000; 75,000; or100,000 different target loci in one reaction volume. In variousembodiments, the library includes primers that simultaneously amplify(or are capable of simultaneously amplifying) between 100 to 500; 500 to1,000; 1,000 to 2,000; 2,000 to 5,000; 5,000 to 7,500; 7,500 to 10,000;10,000 to 20,000; 20,000 to 25,000; 25,000 to 30,000; 30,000 to 40,000;40,000 to 50,000; 50,000 to 75,000; or 75,000 to 100,000 differenttarget loci in one reaction volume, inclusive. In various embodiments,the library includes primers that simultaneously amplify (or are capableof simultaneously amplifying) between 1,000 to 100,000 different targetloci in one reaction volume, such as between 1,000 to 50,000; 1,000 to30,000; 1,000 to 20,000; 1,000 to 10,000; 2,000 to 30,000; 2,000 to20,000; 2,000 to 10,000; 5,000 to 30,000; 5,000 to 20,000; or 5,000 to10,000 different target loci, inclusive. In some embodiments, thelibrary includes primers that simultaneously amplify (or are capable ofsimultaneously amplifying) the target loci in one reaction volume suchthat less than 60, 40, 30, 20, 10, 5, 4, 3, 2, 1, 0.5, 0.25, 0.1, or0.5% of the amplified products are primer dimers. The variousembodiments, the amount of amplified products that are primer dimers isbetween 0.5 to 60%, such as between 0.1 to 40%, 0.1 to 20%, 0.25 to 20%,0.25 to 10%, 0.5 to 20%, 0.5 to 10%, 1 to 20%, or 1 to 10%, inclusive.In some embodiments, the primers simultaneously amplify (or are capableof simultaneously amplifying) the target loci in one reaction volumesuch that at least 50, 60, 70, 80, 90, 95, 96, 97, 98, 99, or 99.5% ofthe amplified products are target amplicons. In various embodiments, theamount of amplified products that are target amplicons is between 50 to99.5%, such as between 60 to 99%, 70 to 98%, 80 to 98%, 90 to 99.5%, or95 to 99.5%, inclusive. In some embodiments, the primers simultaneouslyamplify (or are capable of simultaneously amplifying) the target loci inone reaction volume such that at least 50, 60, 70, 80, 90, 95, 96, 97,98, 99, or 99.5% of the targeted loci are amplified (e.g., amplified atleast 5, 10, 20, 30, 50, or 100-fold compared to the amount prior toamplification). In various embodiments, the amount target loci that areamplified (e.g., amplified at least 5, 10, 20, 30, 50, or 100-foldcompared to the amount prior to amplification) is between 50 to 99.5%,such as between 60 to 99%, 70 to 98%, 80 to 99%, 90 to 99.5%, 95 to99.9%, or 98 to 99.99% inclusive. In some embodiments, the library ofprimers includes at least 100; 200; 500; 750; 1,000; 2,000; 5,000;7,500; 10,000; 20,000; 25,000; 30,000; 40,000; 50,000; 75,000; or100,000 primer pairs, wherein each pair of primers includes a forwardtest primer and a reverse test primer where each pair of test primershybridize to a target locus. In some embodiments, the library of primersincludes at least 100; 200; 500; 750; 1,000; 2,000; 5,000; 7,500;10,000; 20,000; 25,000; 30,000; 40,000; 50,000; 75,000; or 100,000individual primers that each hybridize to a different target locus,wherein the individual primers are not part of primer pairs.

In various embodiments, the concentration of each primer is less than100, 75, 50, 25, 20, 10, 5, 2, or 1 nM, or less than 500, 100, 10, or 1uM. In various embodiments, the concentration of each primer is between1 uM to 100 nM, such as between 1 uM to 1 nM, 1 to 75 nM, 2 to 50 nM or5 to 50 nM, inclusive. In various embodiments, the GC content of theprimers is between 30 to 80%, such as between 40 to 70%, or 50 to 60%,inclusive. In some embodiments, the range of GC content of the primersis less than 30, 20, 10, or 5%. In some embodiments, the range of GCcontent of the primers is between 5 to 30%, such as 5 to 20% or 5 to10%, inclusive. In some embodiments, the melting temperature (T_(m)) ofthe test primers is between 40 to 80° C., such as 50 to 70° C., 55 to65° C., or 57 to 60.5° C., inclusive. In some embodiments, the T_(m) iscalculated using the Primer3 program (libprimer3 release 2.2.3) usingthe built-in SantaLucia parameters (the world wide web atprimer3.sourceforge.net). In some embodiments, the range of meltingtemperature of the primers is less than 15, 10, 5, 3, or 1° C. In someembodiments, the range of melting temperature of the primers is between1 to 15° C., such as between 1 to 10° C., 1 to 5° C., or 1 to 3° C.,inclusive. In some embodiments, the length of the primers is between 15to 100 nucleotides, such as between 15 to 75 nucleotides, 15 to 40nucleotides, 17 to 35 nucleotides, 18 to 30 nucleotides, or 20 to 65nucleotides, inclusive. In some embodiments, the range of the length ofthe primers is less than 50, 40, 30, 20, 10, or 5 nucleotides. In someembodiments, the range of the length of the primers is between 5 to 50nucleotides, such as 5 to 40 nucleotides, 5 to 20 nucleotides, or 5 to10 nucleotides, inclusive. In some embodiments, the length of the targetamplicons is between 50 and 100 nucleotides, such as between 60 and 80nucleotides, or 60 to 75 nucleotides, inclusive. In some embodiments,the range of the length of the target amplicons is less than 50, 25, 15,10, or 5 nucleotides. In some embodiments, the range of the length ofthe target amplicons is between 5 to 50 nucleotides, such as 5 to 25nucleotides, 5 to 15 nucleotides, or 5 to 10 nucleotides, inclusive. Insome embodiments, the library does not comprise a microarray. In someembodiments, the library comprises a microarray.

In some embodiments, some (such as at least 80, 90, or 95%) or all ofthe adaptors or primers include one or more linkages between adjacentnucleotides other than a naturally-occurring phosphodiester linkage.Examples of such linkages include phosphoramide, phosphorothioate, andphosphorodithioate linkages. In some embodiments, some (such as at least80, 90, or 95%) or all of the adaptors or primers include athiophosphate (such as a monothiophosphate) between the last 3′nucleotide and the second to last 3′ nucleotide. In some embodiments,some (such as at least 80, 90, or 95%) or all of the adaptors or primersinclude a thiophosphate (such as a monothiophosphate) between the last2, 3, 4, or 5 nucleotides at the 3′ end. In some embodiments, some (suchas at least 80, 90, or 95%) or all of the adaptors or primers include athiophosphate (such as a monothiophosphate) between at least 1, 2, 3, 4,or 5 nucleotides out of the last 10 nucleotides at the 3′ end. In someembodiments, such primers are less likely to be cleaved or degraded. Insome embodiments, the primers do not contain an enzyme cleavage site(such as a protease cleavage site).

Additional exemplary multiplex PCR methods and libraries are describedin U.S. application Ser. No. 13/683,604, filed Nov. 21, 2012 (U.S.Publication No. 2013/0123120) and U.S. Ser. No. 61/994,791, filed May16, 2014, which are each hereby incorporated by reference in itsentirety). These methods and libraries can be used for analysis of anyof the samples disclosed herein and for use in any of the methods of theinvention.

Exemplary Primer Libraries for Detection of Recombination

In some embodiments, primers in the primer library are designed todetermine whether or not recombination occurred at one or more knownrecombination hotspots (such as crossovers between homologous humanchromosomes). Knowing what crossovers occurred between chromosomesallows more accurate phased genetic data to be determined for anindividual. Recombination hotspots are local regions of chromosomes inwhich recombination events tend to be concentrated. Often they areflanked by “coldspots,” regions of lower than average frequency ofrecombination. Recombination hotspots tend to share a similar morphologyand are approximately 1 to 2 kb in length. The hotspot distribution ispositively correlated with GC content and repetitive elementdistribution. A partially degenerated 13-mer motif CCNCCNTNNCCNC plays arole in some hotspot activity. It has been shown that the zinc fingerprotein called PRDM9 binds to this motif and initiates recombination atits location. The average distance between the centers of recombinationhot spots is reported to be ˜80 kb. In some embodiments, the distancebetween the centers of recombination hot spots ranges between ˜3 kb to˜100 kb. Public databases include a large number of known humanrecombination hotspots, such as the HUMHOT and International HapMapProject databases (see, for example, Nishant et al., “HUMHOT: a databaseof human meiotic recombination hot spots,” Nucleic Acids Research, 34:D25-D28, 2006, Database issue; Mackiewicz et al., “Distribution ofRecombination Hotspots in the Human Genome—A Comparison of ComputerSimulations with Real Data” PLoS ONE 8(6): e65272,doi:10.1371/journal.pone.0065272; and the world wide web athapmap.ncbi.nlm.nih.gov/downloads/index.html.en, which are each herebyincorporated by reference in its entirety).

In some embodiments, primers in the primer library are clustered at ornear recombination hotspots (such as known human recombinationhotspots). In some embodiments, the corresponding amplicons are used todetermine the sequence within or near a recombination hotspot todetermine whether or not recombination occurred at that particularhotspot (such as whether the sequence of the amplicon is the sequenceexpected if a recombination had occurred or the sequence expected if arecombination had not occurred). In some embodiments, primers aredesigned to amplify part or all of a recombination hotspot (andoptionally sequence flanking a recombination hotspot). In someembodiments, long read sequencing (such as sequencing using the MoleculoTechnology developed by Illumina to sequence up to ˜10 kb) or paired endsequencing is used to sequence part or all of a recombination hotspot.Knowledge of whether or not a recombination event occurred can be usedto determine which haplotype blocks flank the hotspot. If desired, thepresence of particular haplotype blocks can be confirmed using primersspecific to regions within the haplotype blocks. In some embodiments, itis assumed there are no crossovers between known recombination hotspots.In some embodiments, primers in the primer library are clustered at ornear the ends of chromosomes. For example, such primers can be used todetermine whether or not a particular arm or section at the end of achromosome is present. In some embodiments, primers in the primerlibrary are clustered at or near recombination hotspots and at or nearthe ends of chromosomes.

In some embodiments, the primer library includes one or more primers(such as at least 5; 10; 50; 100; 200; 500; 750; 1,000; 2,000; 5,000;7,500; 10,000; 20,000; 25,000; 30,000; 40,000; or 50,000 differentprimers or different primer pairs) that are specific for a recombinationhotspot (such as a known human recombination hotspot) and/or arespecific for a region near a recombination hotspot (such as within 10,8, 5, 3, 2, 1, or 0.5 kb of the 5′ or 3′ end of a recombinationhotspot). In some embodiments, at least 1, 5, 10, 20, 40, 60, 80, 100,or 150 different primer (or primer pairs) are specific for the samerecombination hotspot, or are specific for the same recombinationhotspot or a region near the recombination hotspot. In some embodiments,at least 1, 5, 10, 20, 40, 60, 80, 100, or 150 different primer (orprimer pairs) are specific for a region between recombination hotspots(such as a region unlikely to have undergone recombination); theseprimers can be used to confirm the presence of haplotype blocks (such asthose that would be expected depending on whether or not recombinationhas occurred). In some embodiments, at least 10, 20, 30, 40, 50, 60, 70,80, or 90% of the primers in the primer library are specific for arecombination hotspot and/or are specific for a region near arecombination hotspot (such as within 10, 8, 5, 3, 2, 1, or 0.5 kb ofthe 5′ or 3′ end of the recombination hotspot). In some embodiments, theprimer library is used to determine whether or not recombination hasoccurred at greater than or equal to 5; 10; 50; 100; 200; 500; 750;1,000; 2,000; 5,000; 7,500; 10,000; 20,000; 25,000; 30,000; 40,000; or50,000 different recombination hotspots (such as known humanrecombination hotspots). In some embodiments, the regions targeted byprimers to a recombination hotspot or nearby region are approximatelyevenly spread out along that portion of the genome. In some embodiments,at least 1, 5, 10, 20, 40, 60, 80, 100, or 150 different primer (orprimer pairs) are specific for the a region at or near the end of achromosome (such as a region within 20, 10, 5, 1, 0.5, 0.1, 0.01, or0.001 mb from the end of a chromosome). In some embodiments, at least10, 20, 30, 40, 50, 60, 70, 80, or 90% of the primers in the primerlibrary are specific for the a region at or near the end of a chromosome(such as a region within 20, 10, 5, 1, 0.5, 0.1, 0.01, or 0.001 mb fromthe end of a chromosome). In some embodiments, at least 1, 5, 10, 20,40, 60, 80, 100, or 150 different primer (or primer pairs) are specificfor a region within a potential microdeletion in a chromosome. In someembodiments, at least 10, 20, 30, 40, 50, 60, 70, 80, or 90% of theprimers in the primer library are specific for a region within apotential microdeletion in a chromosome. In some embodiments, at least10, 20, 30, 40, 50, 60, 70, 80, or 90% of the primers in the primerlibrary are specific for a recombination hotspot, a region near arecombination hotspot, a region at or near the end of a chromosome, or aregion within a potential microdeletion in a chromosome.

Exemplary Kits

In one aspect, the invention features a kit, such as a kit foramplifying target loci in a nucleic acid sample for detecting deletionsand/or duplications of chromosome segments or entire chromosomes usingany of the methods described herein). In some embodiments, the kit caninclude any of the primer libraries of the invention. In an embodiment,the kit comprises a plurality of inner forward primers and optionally aplurality of inner reverse primers, and optionally outer forward primersand outer reverse primers, where each of the primers is designed tohybridize to the region of DNA immediately upstream and/or downstreamfrom one of the target sites (e.g., polymorphic sites) on the targetchromosome(s) or chromosome segment(s), and optionally additionalchromosomes or chromosome segments. In some embodiments, the kitincludes instructions for using the primer library to amplify the targetloci, such as for detecting one or more deletions and/or duplications ofone or more chromosome segments or entire chromosomes using any of themethods described herein.

In certain embodiments, kits of the invention provide primer pairs fordetecting chromosomal aneuploidy and CNV determination, such as primerpairs for massively multiplex reactions for detecting chromosomalaneuploidy such as CNV (CoNVERGe) (Copy Number Variant Events RevealedGenotypically) and/or SNVs. In these embodiments, the kits can includebetween at least 100, 200, 250, 300, 500, 1000, 2000, 2500, 3000, 5000,10,000, 20,000, 25,000, 28,000, 50,000, or 75,000 and at most 200, 250,300, 500, 1000, 2000, 2500, 3000, 5000, 10,000, 20,000, 25,000, 28,000,50,000, 75,000, or 100,000 primer pairs that are shipped together. Theprimer pairs can be contained in a single vessel, such as a single tubeor box, or multiple tubes or boxes. In certain embodiments, the primerpairs are pre-qualified by a commercial provider and sold together, andin other embodiments, a customer selects custom gene targets and/orprimers and a commercial provider makes and ships the primer pool to thecustomer neither in one tube or a plurality of tubes. In certainexemplary embodiments, the kits include primers for detecting both CNVsand SNVs, especially CNVs and SNVs known to be correlated to at leastone type of cancer.

Kits for circulating DNA detection according to some embodiments of thepresent invention, include standards and/or controls for circulating DNAdetection. For example, in certain embodiments, the standards and/orcontrols are sold and optionally shipped and packaged together withprimers used to perform the amplification reactions provided herein,such as primers for performing CoNVERGe. In certain embodiments, thecontrols include polynucleotides such as DNA, including isolated genomicDNA that exhibits one or more chromosomal aneuploidies such as CNVand/or includes one or more SNVs. In certain embodiments, the standardsand/or controls are called PlasmArt standards and includepolynucleotides having sequence identity to regions of the genome knownto exhibit CNV, especially in certain inherited diseases, and in certaindisease states such as cancer, as well as a size distribution thatreflects that of cfDNA fragments naturally found in plasma. Exemplarymethods for making PlasmArt standards are provided in the examplesherein. In general, genomic DNA from a source known to include achromosomal aneuploidy is isolated, fragmented, purified and sizeselected.

Accordingly, artificial cfDNA polynucleotide standards and/or controlscan be made by spiking isolated polynucleotide samples prepared assummarized above, into DNA samples known not to exhibit a chromosomalaneuploidy and/or SNVs, at concentrations similar to those observed forcfDNA in vivo, such as between, for example, 0.01% and 20%, 0.1 and 15%,or 0.4 and 10% of DNA in that fluid. These standards/controls can beused as controls for assay design, characterization, development, and/orvalidation, and as quality control standards during testing, such ascancer testing performed in a CLIA lab and/or as standards included inresearch use only or diagnostic test kits.

Exemplary Normalization/Correction Methods

In some embodiments, measurements for different loci, chromosomesegments, or chromosomes are adjusted for bias, such as bias due todifferences in GC content or bias due to other differences inamplification efficiency or adjusted for sequencing errors. In someembodiments, measurements for different alleles for the same locus areadjusted for differences in metabolism, apoptosis, histones,inactivation, and/or amplification between the alleles. In someembodiments, measurements for different alleles for the same locus inRNA are adjusted for differences in transcription rates or stabilitybetween different RNA alleles.

Exemplary Methods for Phasing Genetic Data

In some embodiments, genetic data is phased using the methods describedherein or any known method for phasing genetic data (see, e.g., PCTPubl. No. WO2009/105531, filed Feb. 9, 2009, and PCT Publ. No.WO2010/017214, filed Aug. 4, 2009; U.S. Publ. No. 2013/0123120, Nov. 21,2012; U.S. Publ. No. 2011/0033862, filed Oct. 7, 2010; U.S. Publ. No.2011/0033862, filed Aug. 19, 2010; U.S. Publ. No. 2011/0178719, filedFeb. 3, 2011; U.S. Pat. No. 8,515,679, filed Mar. 17, 2008; U.S. Publ.No. 2007/0184467, filed Nov. 22, 2006; U.S. Publ. No. 2008/0243398,filed Mar. 17, 2008, and U.S. Ser. No. 61/994,791, filed May 16, 2014,which are each hereby incorporated by reference in its entirety). Insome embodiments, the phase is determined for one or more regions thatare known or suspected to contain a CNV of interest. In someembodiments, the phase is also determined for one or more regionsflanking the CNV region(s) and/or for one or more reference regions. Inone embodiment, genetic data of an individual (e.g., an individual beingtested using the methods of the invention or a relative of a gestatingfetus or embryo, such as a parent of the fetus or embryo) is phased byinference by measuring tissue from the individual that is haploid, forexample by measuring one or more sperm or eggs. In one embodiment, anindividual's genetic data is phased by inference using the measuredgenotypic data of one or more first degree relatives, such as theindividual's parents (e.g., sperm from the individual's father) orsiblings.

In one embodiment, an individual's genetic data is phased by dilutionwhere the DNA or RNA is diluted in one or a plurality of wells, such asby using digital PCR. In some embodiments, the DNA or RNA is diluted tothe point where there is expected to be no more than approximately onecopy of each haplotype in each well, and then the DNA or RNA in the oneor more wells is measured. In some embodiments, cells are arrested atphase of mitosis when chromosomes are tight bundles, and microfluidicsis used to put separate chromosomes in separate wells. Because the DNAor RNA is diluted, it is unlikely that more than one haplotype is in thesame fraction (or tube). Thus, there may be effectively a singlemolecule of DNA in the tube, which allows the haplotype on a single DNAor RNA molecule to be determined. In some embodiments, the methodincludes dividing a DNA or RNA sample into a plurality of fractions suchthat at least one of the fractions includes one chromosome or onechromosome segment from a pair of chromosomes, and genotyping (e.g.,determining the presence of two or more polymorphic loci) the DNA or RNAsample in at least one of the fractions, thereby determining ahaplotype. In some embodiments, the genotyping involves sequencing (suchas shotgun sequencing or single molecule sequencing), a SNP array todetect polymorphic loci, or multiplex PCR. In some embodiments, thegenotyping involves use of a SNP array to detect polymorphic loci, suchas at least 100; 200; 500; 750; 1,000; 2,000; 5,000; 7,500; 10,000;20,000; 25,000; 30,000; 40,000; 50,000; 75,000; or 100,000 differentpolymorphic loci. In some embodiments, the genotyping involves the useof multiplex PCR. In some embodiments, the method involves contactingthe sample in a fraction with a library of primers that simultaneouslyhybridize to at least 100; 200; 500; 750; 1,000; 2,000; 5,000; 7,500;10,000; 20,000; 25,000; 30,000; 40,000; 50,000; 75,000; or 100,000different polymorphic loci (such as SNPs) to produce a reaction mixture;and subjecting the reaction mixture to primer extension reactionconditions to produce amplified products that are measured with a highthroughput sequencer to produce sequencing data. In some embodiments,RNA (such as mRNA) is sequenced. Since mRNA contains only exons,sequencing mRNA allows alleles to be determined for polymorphic loci(such as SNPs) over a large distance in the genome, such as a fewmegabases. In some embodiments, a haplotype of an individual isdetermined by chromosome sorting. An exemplary chromosome sorting methodincludes arresting cells at phase of mitosis when chromosomes are tightbundles and using microfluidics to put separate chromosomes in separatewells. Another method involves collecting single chromosomes usingFACS-mediated single chromosome sorting. Standard methods (such assequencing or an array) can be used to identify the alleles on a singlechromosome to determine a haplotype of the individual.

In some embodiments, a haplotype of an individual is determined by longread sequencing, such as by using the Moleculo Technology developed byIllumina. In some embodiments, the library prep step involves shearingDNA into fragments, such as fragments of ˜10 kb size, diluting thefragments and placing them into wells (such that about 3,000 fragmentsare in a single well), amplifying fragments in each well by long-rangePCR and cutting into short fragments and barcoding the fragments, andpooling the barcoded fragments from each well together to sequence themall. After sequencing, the computational steps involve separating thereads from each well based on the attached barcodes and grouping theminto fragments, assembling the fragments at their overlappingheterozygous SNVs into haplotype blocks, and phasing the blocksstatistically based on a phased reference panel and producing longhaplotype contigs.

In some embodiments, a haplotype of the individual is determined usingdata from a relative of the individual. In some embodiments, a SNP arrayis used to determine the presence of at least 100; 200; 500; 750; 1,000;2,000; 5,000; 7,500; 10,000; 20,000; 25,000; 30,000; 40,000; 50,000;75,000; or 100,000 different polymorphic loci in a DNA or RNA samplefrom the individual and a relative of the individual. In someembodiments, the method involves contacting a DNA sample from theindividual and/or a relative of the individual with a library of primersthat simultaneously hybridize to at least 100; 200; 500; 750; 1,000;2,000; 5,000; 7,500; 10,000; 20,000; 25,000; 30,000; 40,000; 50,000;75,000; or 100,000 different polymorphic loci (such as SNPs) to producea reaction mixture; and subjecting the reaction mixture to primerextension reaction conditions to produce amplified products that aremeasured with a high throughput sequencer to produce sequencing data.

In one embodiment, an individual's genetic data is phased using acomputer program that uses population based haplotype frequencies toinfer the most likely phase, such as HapMap-based phasing. For example,haploid data sets can be deduced directly from diploid data usingstatistical methods that utilize known haplotype blocks in the generalpopulation (such as those created for the public HapMap Project and forthe Perlegen Human Haplotype Project). A haplotype block is essentiallya series of correlated alleles that occur repeatedly in a variety ofpopulations. Since these haplotype blocks are often ancient and common,they may be used to predict haplotypes from diploid genotypes. Publiclyavailable algorithms that accomplish this task include an imperfectphylogeny approach, Bayesian approaches based on conjugate priors, andpriors from population genetics. Some of these algorithms use a hiddenMarkov model.

In one embodiment, an individual's genetic data is phased using analgorithm that estimates haplotypes from genotype data, such as analgorithm that uses localized haplotype clustering (see, e.g., Browningand Browning, “Rapid and Accurate Haplotype Phasing and Missing-DataInference for Whole-Genome Association Studies By Use of LocalizedHaplotype Clustering” Am J Hum Genet. November 2007; 81(5): 1084-1097,which is hereby incorporated by reference in its entirety). An exemplaryprogram is Beagle version: 3.3.2 or version 4 (available at the worldwide web at hfaculty.washington.edu/browning/beagle/beagle.html, whichis hereby incorporated by reference in its entirety).

In one embodiment, an individual's genetic data is phased using analgorithm that estimates haplotypes from genotype data, such as analgorithm that uses the decay of linkage disequilibrium with distance,the order and spacing of genotyped markers, missing-data imputation,recombination rate estimates, or a combination thereof (see, e.g.,Stephens and Scheet, “Accounting for Decay of Linkage Disequilibrium inHaplotype Inference and Missing-Data Imputation” Am. J. Hum. Genet.76:449-462, 2005, which is hereby incorporated by reference in itsentirety). An exemplary program is PHASE v.2.1 or v2.1.1. (available atthe world wide web at stephenslab.uchicago.edu/software.html, which ishereby incorporated by reference in its entirety).

In one embodiment, an individual's genetic data is phased using analgorithm that estimates haplotypes from population genotype data, suchas an algorithm that allows cluster memberships to change continuouslyalong the chromosome according to a hidden Markov model. This approachis flexible, allowing for both “block-like” patterns of linkagedisequilibrium and gradual decline in linkage disequilibrium withdistance (see, e.g., Scheet and Stephens, “A fast and flexiblestatistical model for large-scale population genotype data: applicationsto inferring missing genotypes and haplotypic phase.” Am J Hum Genet,78:629-644, 2006, which is hereby incorporated by reference in itsentirety). An exemplary program is fastPHASE (available at the worldwide web at stephenslab.uchicago.edu/software.html, which is herebyincorporated by reference in its entirety).

In one embodiment, an individual's genetic data is phased using agenotype imputation method, such as a method that uses one or more ofthe following reference datasets: HapMap dataset, datasets of controlsgenotyped on multiple SNP chips, and densely typed samples from the1,000 Genomes Project. An exemplary approach is a flexible modellingframework that increases accuracy and combines information acrossmultiple reference panels (see, e.g., Howie, Donnelly, and Marchini(2009) “A flexible and accurate genotype imputation method for the nextgeneration of genome-wide association studies.” PLoS Genetics 5(6):e1000529, 2009, which is hereby incorporated by reference in itsentirety). Exemplary programs are IMPUTE or IMPUTE version 2 (also knownas IMPUTE2) (available at the world wide web atmathgen.stats.ox.ac.uk/impute/impute_v2.html, which is herebyincorporated by reference in its entirety).

In one embodiment, an individual's genetic data is phased using analgorithm that infers haplotypes, such as an algorithm that infershaplotypes under the genetic model of coalescence with recombination,such as that developed by Stephens in PHASE v2.1. The major algorithmicimprovements rely on the use of binary trees to represent the sets ofcandidate haplotypes for each individual. These binary treerepresentations: (1) speed up the computations of posteriorprobabilities of the haplotypes by avoiding the redundant operationsmade in PHASE v2.1, and (2) overcome the exponential aspect of thehaplotypes inference problem by the smart exploration of the mostplausible pathways (i.e., haplotypes) in the binary trees (see, e.g.,Delaneau, Coulonges and Zagury, “Shape-IT: new rapid and accuratealgorithm for haplotype inference,” BMC Bioinformatics 9:540, 2008doi:10.1186/1471-2105-9-540, which is hereby incorporated by referencein its entirety). An exemplary program is SHAPEIT (available at theworld wide web atmathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html, which ishereby incorporated by reference in its entirety).

In one embodiment, an individual's genetic data is phased using analgorithm that estimates haplotypes from population genotype data, suchas an algorithm that uses haplotype-fragment frequencies to obtainempirically based probabilities for longer haplotypes. In someembodiments, the algorithm reconstructs haplotypes so that they havemaximal local coherence (see, e.g., Eronen, Geerts, and Toivonen,“HaploRec: Efficient and accurate large-scale reconstruction ofhaplotypes,” BMC Bioinformatics 7:542, 2006, which is herebyincorporated by reference in its entirety). An exemplary program isHaploRec, such as HaploRec version 2.3. (available at the world wide webat cs.helsinki.fi/group/genetics/haplotyping.html, which is herebyincorporated by reference in its entirety).

In one embodiment, an individual's genetic data is phased using analgorithm that estimates haplotypes from population genotype data, suchas an algorithm that uses a partition-ligation strategy and anexpectation-maximization-based algorithm (see, e.g., Qin, Niu, and Liu,“Partition-Ligation-Expectation-Maximization Algorithm for HaplotypeInference with Single-Nucleotide Polymorphisms,” Am J Hum Genet. 71(5):1242-1247, 2002, which is hereby incorporated by reference in itsentirety). An exemplary program is PL-EM (available at the world wideweb at people.fas.harvard.edu/˜junliu/plem/click.html, which is herebyincorporated by reference in its entirety).

In one embodiment, an individual's genetic data is phased using analgorithm that estimates haplotypes from population genotype data, suchas an algorithm for simultaneously phasing genotypes into haplotypes andblock partitioning. In some embodiments, an expectation-maximizationalgorithm is used (see, e.g., Kimmel and Shamir, “GERBIL: GenotypeResolution and Block Identification Using Likelihood,” Proceedings ofthe National Academy of Sciences of the United States of America (PNAS)102: 158-162, 2005, which is hereby incorporated by reference in itsentirety). An exemplary program is GERBIL, which is available as part ofthe GEVALT version 2 program (available at the world wide web atacgt.cs.tau.ac.il/gevalt/, which is hereby incorporated by reference inits entirety).

In one embodiment, an individual's genetic data is phased using analgorithm that estimates haplotypes from population genotype data, suchas an algorithm that uses an EM algorithm to calculate ML estimates ofhaplotype frequencies given genotype measurements which do not specifyphase. The algorithm also allows for some genotype measurements to bemissing (due, for example, to PCR failure). It also allows multipleimputation of individual haplotypes (see, e.g., Clayton, D. (2002),“SNPHAP: A Program for Estimating Frequencies of Large Haplotypes ofSNPs”, which is hereby incorporated by reference in its entirety). Anexemplary program is SNPHAP (available at the world wide web atgene.cimr.cam.ac.uk/clayton/software/snphap.txt, which is herebyincorporated by reference in its entirety).

In one embodiment, an individual's genetic data is phased using analgorithm that estimates haplotypes from population genotype data, suchas an algorithm for haplotype inference based on genotype statisticscollected for pairs of SNPs. This software can be used for comparativelyaccurate phasing of large number of long genome sequences, e.g. obtainedfrom DNA arrays. An exemplary program takes genotype matrix as an input,and outputs the corresponding haplotype matrix (see, e.g., Brinza andZelikovsky, “2SNP: scalable phasing based on 2-SNP haplotypes,”Bioinformatics. 22(3):371-3, 2006, which is hereby incorporated byreference in its entirety). An exemplary program is 2SNP (available atthe world wide web at alla.cs.gsu.edu/˜software/2SNP, which is herebyincorporated by reference in its entirety).

In various embodiments, an individual's genetic data is phased usingdata about the probability of chromosomes crossing over at differentlocations in a chromosome or chromosome segment (such as usingrecombination data such as may be found in the HapMap database to createa recombination risk score for any interval) to model dependence betweenpolymorphic alleles on the chromosome or chromosome segment. In someembodiments, allele counts at the polymorphic loci are calculated on acomputer based on sequencing data or SNP array data. In someembodiments, a plurality of hypotheses each pertaining to a differentpossible state of the chromosome or chromosome segment (such as anoverrepresentation of the number of copies of a first homologouschromosome segment as compared to a second homologous chromosome segmentin the genome of one or more cells from an individual, a duplication ofthe first homologous chromosome segment, a deletion of the secondhomologous chromosome segment, or an equal representation of the firstand second homologous chromosome segments) are created (such as creationon a computer); a model (such as a joint distribution model) for theexpected allele counts at the polymorphic loci on the chromosome isbuilt (such as building on a computer) for each hypothesis; a relativeprobability of each of the hypotheses is determined (such asdetermination on a computer) using the joint distribution model and theallele counts; and the hypothesis with the greatest probability isselected. In some embodiments, building a joint distribution model forallele counts and the step of determining the relative probability ofeach hypothesis are done using a method that does not require the use ofa reference chromosome.

In one embodiment, genetic data of an individual is phased using geneticdata of one or more relatives of the individual (such as one or moreparents, siblings, children, fetuses, embryos, grandparents, uncles,aunts, or cousins). In one embodiment, genetic data of an individual isphased using genetic data of one or more genetic offspring of theindividual (e.g., 1, 2, 3, or more offspring), such as embryos, fetuses,born children, or a sample of a miscarriage. In one embodiment, geneticdata of a parent (such as a parent of a gestating fetus or embryo) isphased using phased haplotypic data for the other parent along withunphased genetic data of one or more genetic offspring of the parents.

In some embodiments, a sample (e.g., a biopsy such as a tumor biopsy,blood sample, plasma sample, serum sample, or another sample likely tocontain mostly or only cells, DNA, or RNA with a CNV of interest) fromthe individual (such as an individual suspected of having cancer, afetus, or an embryo) is analyzed to determine the phase for one or moreregions that are known or suspected to contain a CNV of interest (suchas a deletion or duplication). In some embodiments, the sample has ahigh tumor fraction (such as 30, 40, 50, 60, 70, 80, 90, 95, 98, 99, or100%). In some embodiments, a sample (e.g., a maternal whole bloodsample, cells isolated from a maternal blood sample, maternal plasmasample, maternal serum sample, amniocentesis sample, placental tissuesample (e.g., chorionic villus, decidua, or placental membrane),cervical mucus sample, fetal tissue after fetal demise, other samplefrom a fetus, or another sample likely to contain mostly or only cells,DNA, or RNA with a CNV of interest) from a fetus or the pregnant motherof a fetus is analyzed to determine the phase for one or more regionsthat are known or suspected to contain a CNV of interest (such as adeletion or duplication). In some embodiments, the sample has a highfetal fraction (such as 25, 30, 40, 50, 60, 70, 80, 90, 95, 98, 99, or100%).

In some embodiments, the sample has a haplotypic imbalance or anyaneuploidy. In some embodiments, the sample includes any mixture of twotypes of DNA where the two types have different ratios of the twohaplotypes, and share at least one haplotype. For example, in thefetal-maternal case, the mother is 1:1 and the fetus is 1:0 (plus apaternal haplotype). For example, in the tumor case, the normal tissueis 1:1, and the tumor tissue is 1:0 or 1:2, 1:3, 1:4, etc. In someembodiments, at least 10; 100; 500; 1,000; 2,000; 3,000; 5,000; 8,000;or 10,000 polymorphic loci are analyzed to determine the phase ofalleles at some or all of the loci. In some embodiments, a sample isfrom a cell or tissue that was treated to become aneuploidy, such asaneuploidy induced by prolonged cell culture.

In some embodiments, a large percent or all of the DNA or RNA in thesample has the CNV of interest. In some embodiments, the ratio of DNA orRNA from the one or more target cells that contain the CNV of interestto the total DNA or RNA in the sample is at least 80, 85, 90, 95, or100%. For samples with a deletion, only one haplotype is present for thecells (or DNA or RNA) with the deletion. This first haplotype can bedetermined using standard methods to determine the identity of allelespresent in the region of the deletion. In samples that only containcells (or DNA or RNA) with the deletion, there will only be signal fromthe first haplotype that is present in those cells. In samples that alsocontain a small amount of cells (or DNA or RNA) without the deletion(such as a small amount of noncancerous cells), the weak signal from thesecond haplotype in these cells (or DNA or RNA) can be ignored. Thesecond haplotype that is present in other cells, DNA, or RNA from theindividual that lack the deletion can be determined by inference. Forexample, if the genotype of cells from the individual without thedeletion is (AB,AB) and the phased data for the individual indicatesthat the first haplotype is (A,A); then, the other haplotype can beinferred to be (B,B).

For samples in which both cells (or DNA or RNA) with a deletion andcells (or DNA or RNA) without a deletion are present, the phase canstill be determined. For example, plots can be generated similar to FIG.18 or 29 in which the x-axis represents the linear position of theindividual loci along the chromosome, and the y-axis represents thenumber of A allele reads as a fraction of the total (A+B) allele reads.In some embodiments for a deletion, the pattern includes two centralbands that represent SNPs for which the individual is heterozygous (topband represents AB from cells without the deletion and A from cells withthe deletion, and bottom band represents AB from cells without thedeletion and B from cells with the deletion). In some embodiments, theseparation of these two bands increases as the fraction of cells, DNA,or RNA with the deletion increases. Thus, the identity of the A allelescan be used to determine the first haplotype, and the identity of the Balleles can be used to determine the second haplotype.

For samples with a duplication, an extra copy of the haplotype ispresent for the cells (or DNA or RNA) with duplication. This haplotypeof the duplicated region can be determined using standard methods todetermine the identity of alleles present at an increased amount in theregion of the duplication, or the haplotype of the region that is notduplicated can be determined using standard methods to determine theidentity of alleles present at an decreased amount. Once one haplotypeis determined, the other haplotype can be determined by inference.

For samples in which both cells (or DNA or RNA) with a duplication andcells (or DNA or RNA) without a duplication are present, the phase canstill be determined using a method similar to that described above fordeletions. For example, plots can be generated similar to FIG. 18 or 29in which the x-axis represents the linear position of the individualloci along the chromosome, and the y-axis represents the number of Aallele reads as a fraction of the total (A+B) allele reads. In someembodiments for a deletion, the pattern includes two central bands thatrepresent SNPs for which the individual is heterozygous (top bandrepresents AB from cells without the duplication and AAB from cells withthe duplication, and bottom band represents AB from cells without theduplication and ABB from cells with the duplication). In someembodiments, the separation of these two bands increases as the fractionof cells, DNA, or RNA with the duplication increases. Thus, the identityof the A alleles can be used to determine the first haplotype, and theidentity of the B alleles can be used to determine the second haplotype.In some embodiments, the phase of one or more CNV region(s) (such as thephase of at least 50, 60, 70, 80, 90, 95, or 100% of the polymorphicloci in the region that were measured) is determined for a sample (suchas a tumor biopsy or plasma sample) from an individual known to havecancer and is used for analysis of subsequent samples from the sameindividual to monitor the progression of the cancer (such as monitoringfor remission or reoccurrence of the cancer). In some embodiments, asample with a high tumor fraction (such as a tumor biopsy or a plasmasample from an individual with a high tumor load) is used to obtainphased data that is used for analysis of subsequent samples with a lowertumor fraction (such as a plasma sample from an individual undergoingtreatment for cancer or in remission).

In another embodiment for prenatal diagnostics, phased parentalhaplotypic data is to detect the presence of more than one homolog fromthe father, implying that the genetic material from more than one fetusis present in a maternal blood sample. By focusing on chromosomes thatare expected to be euploid in a fetus, one could rule out thepossibility that the fetus was afflicted with a trisomy. Also, it ispossible to determine if the fetal DNA is not from the current father.

In some embodiments, two or more of the methods described herein areused to phase genetic data of an individual. In some embodiments, both abioinformatics method (such as using population based haplotypefrequencies to infer the most likely phase) and a molecular biologymethod (such as any of the molecular phasing methods disclosed herein toobtain actual phased data rather than bioinformatics-based inferredphased data) are used. In some embodiments, phased data from othersubjects (such as prior subjects) is used to refine the population data.For example, phased data from other subjects can be added to populationdata to calculate priors for possible haplotypes for another subject. Insome embodiments, phased data from other subjects (such as priorsubjects) is used to calculate priors for possible haplotypes foranother subject.

In some embodiments, probabilistic data may be used. For example, due tothe probabilistic nature of the representation of DNA molecules in asample, as well as various amplification and measurement biases, therelative number of molecules of DNA measured from two different loci, orfrom different alleles at a given locus, is not always representative ofthe relative number of molecules in the mixture, or in the individual.If one were trying to determine the genotype of a normal diploidindividual at a given locus on an autosomal chromosome by sequencing DNAfrom the plasma of the individual, one would expect to either observeonly one allele (homozygous) or about equal numbers of two alleles(heterozygous). If, at that allele, ten molecules of the A allele wereobserved, and two molecules of the B allele were observed, it would notbe clear if the individual was homozygous at the locus, and the twomolecules of the B allele were due to noise or contamination, or if theindividual was heterozygous, and the lower number of molecules of the Ballele were due to random, statistical variation in the number ofmolecules of DNA in the plasma, amplification bias, contamination or anynumber of other causes. In this case, a probability that the individualwas homozygous, and a corresponding probability that the individual washeterozygous could be calculated, and these probabilistic genotypescould be used in further calculations.

Note that for a given allele ratio, the likelihood that the ratioclosely represents the ratio of the DNA molecules in the individual isgreater the greater the number of molecules that are observed. Forexample, if one were to measure 100 molecules of A and 100 molecules ofB, the likelihood that the actual ratio was 50% is considerably greaterthan if one were to measure 10 molecules of A and 10 molecules of B. Inone embodiment, one uses Bayesian theory combined with a detailed modelof the data to determine the likelihood that a particular hypothesis iscorrect given an observation. For example, if one were considering twohypotheses—one that corresponds to a trisomic individual and one thatcorresponds to a disomic individual—then the probability of the disomichypothesis being correct would be considerably higher for the case where100 molecules of each of the two alleles were observed, as compared tothe case where 10 molecules of each of the two alleles were observed. Asthe data becomes noisier due to bias, contamination or some other sourceof noise, or as the number of observations at a given locus goes down,the probability of the maximum likelihood hypothesis being true giventhe observed data drops. In practice, it is possible to aggregateprobabilities over many loci to increase the confidence with which themaximum likelihood hypothesis may be determined to be the correcthypothesis. In some embodiments, the probabilities are simply aggregatedwithout regard for recombination. In some embodiments, the calculationstake into account cross-overs.

In an embodiment, probabilistically phased data is used in thedetermination of copy number variation. In some embodiments, theprobabilistically phased data is population based haplotype blockfrequency data from a data source such as the HapMap data base. In someembodiments, the probabilistically phased data is haplotypic dataobtained by a molecular method, for example phasing by dilution whereindividual segments of chromosomes are diluted to a single molecule perreaction, but where, due to stochastic noise the identities of thehaplotypes may not be absolutely known. In some embodiments, theprobabilistically phased data is haplotypic data obtained by a molecularmethod, where the identities of the haplotypes may be known with a highdegree of certainty.

Imagine a hypothetical case where a doctor wanted to determine whetheror not an individual had some cells in their body which had a deletionat a particular chromosomal segment by measuring the plasma DNA from theindividual. The doctor could make use of the knowledge that if all ofthe cells from which the plasma DNA originated were diploid, and of thesame genotype, then for heterozygous loci, the relative number ofmolecules of DNA observed for each of the two alleles would fall intoone distribution that was centered at 50% A allele and 50% B allele.However, if a fraction of the cells from which the plasma DNA originatedhad a deletion at a particular chromosome segment, then for heterozygousloci, one would expect that the relative number of molecules of DNAobserved for each of the two alleles would fall into two distributions,one centered at above 50% A allele for the loci where there was adeletion of the chromosome segment containing the B allele, and onecentered at below 50% for the loci where there was a deletion of thechromosome segment containing the A allele. The greater the proportionof the cells from which the plasma DNA originated contained thedeletion, the further from 50% these two distributions would be.

In this hypothetical case, imagine a clinician who wants to determine ifan individual had a deletion of a chromosomal region in a proportion ofcells in the individual's body. The clinician may draw blood from theindividual into a vacutainer or other type of blood tube, centrifuge theblood, and isolate the plasma layer. The clinician may isolate the DNAfrom the plasma, enrich the DNA at the targeted loci, possibly throughtargeted or other amplification, locus capture techniques, sizeenrichment, or other enrichment techniques. The clinician may analyzesuch as by measuring the number of alleles at a set of SNPs, in otherwords generating allele frequency data, the enriched and/or amplifiedDNA using an assay such as qPCR, sequencing, a microarray, or othertechniques that measure the quantity of DNA in a sample. We willconsider data analysis for the case where the clinician amplified thecell-free plasma DNA using a targeted amplification technique, and thensequenced the amplified DNA to give the following exemplary possibledata at six SNPs found on a chromosome segment that is indicative ofcancer, where the individual was heterozygotic at those SNPs:

SNP 1: 460 reads A allele; 540 reads B allele (46% A)

SNP 2: 530 reads A allele; 470 reads B allele (53% A)

SNP 3: 40 reads A allele; 60 reads B allele (40% A)

SNP 4: 46 reads A allele; 54 reads B allele (46% A)

SNP 5: 520 reads A allele; 480 reads B allele (52% A)

SNP 6: 200 reads A allele; 200 reads B allele (50% A)

From this set of data, it may be difficult to differentiate between thecase where the individual is normal, with all cells being disomic, orwhere the individual may have a cancer, with some portion of cells whoseDNA contributed towards the cell-free DNA found in the plasma having adeletion or duplication at the chromosome. For example, the twohypotheses with the maximum likelihood may be that the individual has adeletion at this chromosome segment, with a tumor fraction of 6%, andwhere the deleted segment of the chromosome has the genotype over thesix SNPs of (A,B,A,A,B,B) or (A,B,A,A,B,A). In this representation ofthe individual's genotype over a set of SNPs, the first letter in theparentheses corresponds to the genotype of the haplotype for SNP 1, thesecond to SNP 2, etc.

If one were to use a method to determine the haplotype of the individualat that chromosome segment, and were to find that the haplotype for oneof the two chromosomes was (A,B,A,A,B,B), this would agree with themaximum likelihood hypothesis, and the calculated likelihood that theindividual has a deletion at that segment, and therefore may havecancerous or precancerous cells, would be considerably increased. On theother hand, if the individual were found to have the haplotype(A,A,A,A,A,A), then the likelihood that the individual has a deletion atthat chromosome segment would be considerably decreased, and perhaps thelikelihood of the no-deletion hypothesis would be higher (the actuallikelihood values would depend on other parameters such as the measurednoise in the system, among others).

There are many ways to determine the haplotype of the individual, manyof which are described elsewhere in this document. A partial list isgiven here, and is not meant to be exhaustive. One method is abiological method where individual DNA molecules are diluted untilapproximately one molecule from each chromosomal region is in any givenreaction volume, and then methods such as sequencing are used to measurethe genotype. Another method is informatically based where populationdata on various haplotypes coupled with their frequency can be used in aprobabilistic manner. Another method is to measure the diploid data ofthe individual, along with one or a plurality of related individuals whoare expected to share haplotype blocks with the individual and to inferthe haplotype blocks. Another method would be to take a sample of tissuewith a high concentration of the deleted or duplicated segment, anddetermine the haplotype based on allelic imbalance, for example,genotype measurements from a sample of tumor tissue with a deletion canbe used to determine the phased data for that deletion region, and thisdata can then be used to determine if the cancer has regrownpost-resection.

In practice, typically more than 20 SNPs, more than 50 SNPs, more than100 SNPs, more than 500 SNPs, more than 1,000 SNPs, or more than 5,000SNPs are measured on a given chromosome segment.

Exemplary Methods for Phasing, Predicting Allele Ratios, andReconstructing Fetal Genetic Data

In one aspect, the invention features methods for determining one ormore haplotypes of a fetus. In various embodiments, this method allowsone to determine which polymorphic loci (such as SNPs) were inherited bythe fetus and to reconstruct which homologs (including recombinationevents) are present in the fetus (and thereby interpolate the sequencebetween the polymorphic loci). If desired, essentially the entire genomeof the fetus can be reconstructed. If there is some remaining ambiguityin the genome of the fetus (such as in intervals with a crossover), thisambiguity can be minimized if desired by analyzing additionalpolymorphic loci. In various embodiments, the polymorphic loci arechosen to cover one or more of the chromosomes at a density to reduceany ambiguity to a desired level. This method has important applicationsfor the detection of polymorphisms or other mutations of interest (suchas deletions or duplications) in a fetus since it enables theirdetection based on linkage (such as the presence of linked polymorphicloci in the fetal genome) rather than by directing detecting thepolymorphism or other mutation of interest in the fetal genome. Forexample, if a parent is a carrier for a mutation associated with cysticfibrosis (CF), a nucleic acid sample that includes maternal DNA from themother of the fetus and fetal DNA from the fetus can be analyzed todetermine whether the fetal DNA include the haplotype containing the CFmutation. In particular, polymorphic loci can be analyzed to determinewhether the fetal DNA includes the haplotype containing the CF mutationwithout having to detect the CF mutation itself in the fetal DNA. Thisis useful in screening for one or more mutations, such as disease-linkedmutations, without having to directly detect the mutations.

In some embodiments, the method involves determining a parentalhaplotype (e.g., a haplotype of the mother or father of the fetus), suchas by using any of the methods described herein. In some embodiments,this determination is made without using data from a relative of themother or father. In some embodiments, a parental haplotype isdetermined using a dilution approach followed by SNP genotyping orsequencing as described herein. In some embodiments, a haplotype of themother (or father) is determined by any of the methods described hereinusing data from a relative of the mother (or father). In someembodiments, a haplotype is determined for both the father and themother.

This parental haplotype data can be used to determine if the fetusinherited the parental haplotype. In some embodiments, a nucleic acidsample that includes maternal DNA from the mother of the fetus and fetalDNA from the fetus is analyzed using a SNP array to detect at least 100;200; 500; 750; 1,000; 2,000; 5,000; 7,500; 10,000; 20,000; 25,000;30,000; 40,000; 50,000; 75,000; or 100,000 different polymorphic loci.In some embodiments, a nucleic acid sample that includes maternal DNAfrom the mother of the fetus and fetal DNA from the fetus is analyzed bycontacting the sample with a library of primers that simultaneouslyhybridize to at least 100; 200; 500; 750; 1,000; 2,000; 5,000; 7,500;10,000; 20,000; 25,000; 30,000; 40,000; 50,000; 75,000; or 100,000different polymorphic loci (such as SNPs) to produce a reaction mixture.In some embodiments, the reaction mixture is subjected to primerextension reaction conditions to produce amplified products. In someembodiments, the amplified products are measured with a high throughputsequencer to produce sequencing data.

In various embodiments, a fetal haplotype is determined using data aboutthe probability of chromosomes crossing over at different locations in achromosome or chromosome segment (such as by using recombination datasuch as may be found in the HapMap database to create a recombinationrisk score for any interval) to model dependence between polymorphicalleles on the chromosome or chromosome segment as described above. Insome embodiments, the method takes into account physical distance of theSNPs (such as SNPs flanking a gene or mutation of interest) andrecombination data from location specific recombination likelihoods andthe data observed from the genetic measurements of the maternal plasmato obtain the most likely genotype of the fetus. Then PARENTAL SUPPORT™may be performed on the targeted sequencing or SPN array data obtainedfrom these SNPs to determine which homologs were inherited by the fetusfrom both parents (see, e.g., U.S. application Ser. No. 11/603,406 (USPublication No. 20070184467), U.S. application Ser. No. 12/076,348 (USPublication No. 20080243398), U.S. application Ser. No. 13/110,685 (U.S.Publication No. 2011/0288780), PCT Application PCT/US09/52730 (PCTPublication No. WO/2010/017214), and PCT Application No. PCT/US10/050824(PCT Publication No. WO/2011/041485), U.S. application Ser. No.13/300,235 (U.S. Publication No. 2012/0270212), U.S. application Ser.No. 13/335,043 (U.S. Publication No. 2012/0122701), U.S. applicationSer. No. 13/683,604, and U.S. application Ser. No. 13/780,022, which areeach hereby incorporated by reference in its entirety).

Assume a generalized example where the possible alleles at one locus areA and B; assignment of the identity A or B to particular alleles isarbitrary. Parental genotypes for a particular SNP, termed geneticcontexts, are expressed as maternal genotype|paternal genotype. Thus, ifthe mother is homozygous and the father is heterozygous, this would berepresented as AA|AB. Similarly, if both parents are homozygous for thesame allele, the parental genotypes would be represented as AA|AA.Furthermore, the fetus would never have AB or BB states and the numberof sequence reads with the B allele will be low, and thus can be used todetermine the noise responses of the assay and genotyping platform,including effects such as low level DNA contamination and sequencingerrors; these noise responses are useful for modeling expected geneticdata profiles. There are only five possible maternal|paternal geneticcontexts: AA|AA, AA|AB, AB|AA, AB|AB, and AA|BB; other contexts areequivalent by symmetry. SNPs where the parents are homozygous for thesame allele are only informative for determining noise and contaminationlevels. SNPs where the parents are not homozygous for the same alleleare informative in determining fetal fraction and copy number count.

Let N_(A,i) and N_(B,i) represent the number of reads of each allele atSNP i, and let Ci represent the parental genetic context at that locus.The data set for a particular chromosome is represented byN_(AB)={N_(A,i),N_(B,i)}=1 . . . N and C={C_(i)}, i=1 . . . N. Forreconstructing part or all of the fetal genome, it can optionally bedetermined if the fetus has an aneuploidy (such as a missing or extracopy of a chromosome or chromosome segment). For each individualchromosome or chromosome under study, let H represent the set of one ormore hypotheses for the total number of chromosomes, the parental originof each chromosome, and the positions on the parent chromosomes whererecombination occurred during formation of the gametes that fertilizedto create the child. The probability of a hypothesis P(H) can becomputed using the data from the HapMap database and prior informationrelated to each of the ploidy states.

Furthermore, let F represent the fetal cfDNA fraction in the sample.Given a set of possible H, C, and F, one can compute the probability ofN_(AB), P(N_(AB)|H,F,C) based on modeling the noise sources of themolecular assay and the sequencing platform. The goal is to find thehypothesis H and the fetal fraction F that maximizes P(H,F|N_(AB)).Using standard Bayesian statistical techniques, and assuming a uniformprobability distribution for F from 0 to 1, this can be recast in termsof maximizing the probability of P(N_(AB)|H,F,C)P(H) over H and F, allof which can now be computed. The probability of all hypothesesassociated with a particular copy number and fetal fraction, e.g.,trisomy and F=10%, but covering all possible parental chromosome originsand crossover locations, are summed. The copy number hypothesis with thehighest probability is selected as the test result, the fetal fractionassociated with that hypothesis reveals the fetal fraction, and theprobability associated with that hypothesis is the calculated accuracyof the result.

In some embodiments, the algorithm uses in silico simulations togenerate a very large number of hypothetical sequencing data sets thatcould result from the possible fetal genetic inheritance patterns,sample parameters, and amplification and measurement artifacts of themethod. More specifically, the algorithm first utilizes parentalgenotypes at a large number of SNPs and crossover frequency data fromthe HapMap database to predict possible fetal genotypes. It thenpredicts expected data profiles for the sequencing data that would bemeasured from mixed samples originating from a mother carrying a fetuswith each of the possible fetal genotypes and taking into account avariety of parameters including fetal fraction, expected read depthprofile, fetal genome equivalents present in the sample, expectedamplification bias at each of the SNPs, and a number of noiseparameters. A data model describes how the sequencing or SNP array datais expected to appear for each of these hypotheses given the particularparameter set. The hypothesis with the best data fit between thismodeled data and the measured data is selected.

If desired, expected allele ratios can be calculated for DNA or RNA fromthe fetus using the results of what haplotypes were inherited by thefetus. The expected allele ratios can also be calculated for a mixedsample containing nucleic acids from both the mother and the fetus(these allele ratios indicate what is expected for measurement of thetotal amount of each allele, including the amount of the allele fromboth maternal nucleic acids and fetal nucleic acids in the sample). Theexpected allele ratios can be calculated for different hypothesesspecifying the degree of overrepresentation of the first homologouschromosome segment.

In some embodiments, the method involves determining whether the fetushas one or more of the following conditions: cystic fibrosis,Huntington's disease, Fragile X, thalassemia, muscular dystrophy (suchas Duchenne's muscular dystrophy), Alzheimer, Fanconi Anemia, GaucherDisease, Mucolipidosis IV, Niemann-Pick Disease, Tay-Sachs disease,Sickle cell anemia, Parkinson disease, Torsion Dystonia, and cancer. Insome embodiments, a fetal haplotype is determined for one or morechromosomes taken from the group consisting of chromosomes 13, 18, 21,X, and Y. In some embodiments, a fetal haplotype is determined for allof the fetal chromosomes. In various embodiments, the method determinesessentially the entire genome of the fetus. In some embodiments, thehaplotype is determined for at least 30, 40, 50, 60, 70, 80, 90, or 95%of the genome of the fetus. In some embodiments, the haplotypedetermination of the fetus includes information about which allele ispresent for at least 100; 200; 500; 750; 1,000; 2,000; 5,000; 7,500;10,000; 20,000; 25,000; 30,000; 40,000; 50,000; 75,000; or 100,000different polymorphic loci. In some embodiments, this method is used todetermine a haplotype or allele ratios for an embryo.

Exemplary Methods for Predicting Allele Ratios

Exemplary methods are described below for calculating expected alleleratios for a sample. Table 1 shows expected allele ratios for a mixedsample (such as a maternal blood sample) containing nucleic acids fromboth the mother and the fetus. These expected allele ratios indicatewhat is expected for measurement of the total amount of each allele,including the amount of the allele from both maternal nucleic acids andfetal nucleic acids in the mixed sample. In an example, the mother isheterozygous at two neighboring loci that are expected to cosegregate(e.g., two loci for which no chromosome crossovers are expected betweenthe loci). Thus, the mother is (AB, AB). Now imagine that the phaseddata for the mother indicates that for one haplotype she is (A, A);thus, for the other haplotype one can infer that she is (B, B). Table 1gives the expected allele ratios for different hypotheses where thefetal fraction is 20%. For this example, no knowledge of the paternaldata is assumed, and the heterozygosity rate is assumed to be 50%. Theexpected allele ratios are given in terms of (expected proportion of Areads/total number of reads) for each of the two SNPs. These ratios arecalculated both using maternal phased data (the knowledge that onehaplotype is (A, A) and one is (B, B)) and without using the maternalphased data. Table 1 includes different hypotheses for the number ofcopies of the chromosome segment in the fetus from each parent.

TABLE 1 Expected Genetic Data for Mixed Sample of Maternal and FetalNucleic Acids Expected allele Expected allele Copy Number ratios whenusing ratios when not using Hypothesis maternal phased data maternalphased data Monosomy (0.444; 0.444) (0.444; 0.444) (maternal copy(0.444; 0.555) (0.444; 0.555) missing) (0.555; 0.444) (0.555; 0.444)(0.555; 0.555) (0.555; 0.555) Monosomy (0.444; 0.444) (0.444; 0.444)(paternal copy (0.555; 0.555) (0.444; 0.555) missing) (0.555; 0.444)(0.555; 0.555) Disomy (0.40; 0.40) (0.40; 0.40) (0.40; 0.50) (0.40;0.50) (0.50; 0.40) (0.40; 0.60) (0.50; 0.50) (0.50; 0.40) (0.50; 0.60)(0.50; 0.50) (0.60; 0.50) (0.50; 0.60) (0.60; 0.60) (0.60; 0.40) (0.60;0.50) (0.60; 0.60) Trisomy (0.36; 0.36) (0.36; 0.36) (extra matching(0.36; 0.45) (0.36; 0.45) maternal copy) (0.45; 0.36) (0.36; 0.54)(0.45; 0.45) (0.36; 0.63) (0.54; 0.54) (0.45; 0.36) (0.54; 0.63) (0.45;0.45) (0.63; 0.54) (0.45; 0.54) (0.63; 0.63) (0.45; 0.63) (0.54; 0.36)(0.54; 0.45) (0.54; 0.54) (0.54; 0.63) (0.63; 0.36) (0.63; 0.45) (0.63;0.54) (0.63; 0.63) Trisomy (0.45, 0.45) (0.36; 0.36) (extra unmatching(0.45; 0.54) (0.36; 0.45) maternal copy) (0.54; 0.45) (0.36; 0.54)(0.54; 0.54) (0.36; 0.63) (0.45; 0.36) (0.45; 0.45) (0.45; 0.54) (0.45;0.63) (0.54; 0.36) (0.54; 0.45) (0.54; 0.54) (0.54; 0.63) (0.63; 0.36)(0.63; 0.45) (0.63; 0.54) (0.63; 0.63) Trisomy (0.36; 0.36) (0.36; 0.36)(extra matching (0.36; 0.54) (0.36; 0.45) paternal copy) (0.54; 0.36)(0.36; 0.54) (0.54; 0.54) (0.36; 0.63) (0.45; 0.45) (0.45; 0.36) (0.45;0.63) (0.45; 0.45) (0.63; 0.45) (0.45; 0.54) (0.63; 0.63) (0.45; 0.63)(0.54; 0.36) (0.54; 0.45) (0.54; 0.54) (0.54; 0.63) (0.63; 0.36) (0.63;0.45) (0.63; 0.54) (0.63; 0.63) Trisomy (extra (0.36; 0.36) (0.36; 0.36)unmatching (0.36; 0.45) (0.36; 0.45) paternal copy) (0.36; 0.54) (0.36;0.54) (0.36; 0.63) (0.36; 0.63) (0.45; 0.36) (0.45; 0.36) (0.45; 0.45)(0.45; 0.45) (0.45; 0.54) (0.45; 0.54) (0.45; 0.63) (0.45; 0.63) (0.54;0.36) (0.54; 0.36) (0.54; 0.45) (0.54; 0.45) (0.54; 0.54) (0.54; 0.54)(0.54; 0.63) (0.54; 0.63) (0.63; 0.36) (0.63; 0.36) (0.63; 0.45) (0.63;0.45) (0.63; 0.54) (0.63; 0.54) (0.63; 0.63) (0.63; 0.63)

In addition to the fact that using phased data reduces the number ofpossible expected allele ratios, it also changes the prior likelihood ofeach of the expected allele ratios, such that the maximum likelihoodresult is more likely to be correct Eliminating expected allele ratiosor hypotheses that are not possible increases the likelihood that thecorrect hypothesis will be chosen. As an example, suppose the measuredallele ratios are (0.41, 0.59). Without using phased data, one mightassume that the hypothesis with maximum likelihood is a disomyhypothesis (given the similarity of the measured allele ratios toexpected allele ratios of (0.40, 0.60) for disomy). However, usingphased data, one can exclude (0.40, 0.60) as expected allele ratios forthe disomy hypothesis, and one can select a trisomy hypothesis as morelikely.

Assume the measured allele ratios are (0.4, 0.4). Without any haplotypeinformation, the probability of a maternal deletion at each SNP would bethe 0.5×P(A deleted)+0.5×P(B deleted). Therefore, although it looks likeA is deleted (missing in the fetus), the likelihood of deletion would bethe average of the two. For high enough fetal fraction, one can stilldetermine the most likely hypothesis. For low enough fetal fraction,averaging may work in disfavor of the deletion hypothesis. However, withhaplotype information, the probability of homolog 1 being deleted, P (Adeleted), is greater and will fit the measured data better. If desired,crossover probabilities between the two loci can also be considered.

In a further illustrative example of combining likelihoods using phaseddata, consider two consecutive SNPs s1 and s2, and D1 and D2 denote theallele data in these SNPs. Here we provide an example on how to combinethe likelihoods for these two SNPs. Let c denote the probability thattwo consecutive heterozygous SNPs have the same allele in the samehomolog (i.e., both SNPs are AB or both SNPs are BA). Hence 1-c is theprobability that one SNP is AB and the other one is BA. For example,consider the hypothesis H10 and allelic imbalance value f. First, assumethat all likelihoods are computed assuming that all SNPs are either ABor BA. Then, we can combine the likelihoods in two consecutive SNPs asfollows:

Lik(D ₁ ,D ₂ |H ₁₀ ,f)=Lik(D ₁ |H ₁₀ ,f)×c×Lik(D ₂ |H ₁₀ ,f)+Lik(D ₁ |H₁₀ ,f)×(1−c)×Lik(D ₂ |H ₀₁ ,f).

We can do this recursively to determine the joint likelihood Lik(D₁, . .. , D_(N)|H₁₀, f) for all SNPs.

Exemplary Mutations

Exemplary mutations associated with a disease or disorder such as canceror an increased risk (such as an above normal level of risk) for adisease or disorder such as cancer include single nucleotide variants(SNVs), multiple nucleotide mutations, deletions (such as deletion of a2 to 30 million base pair region), duplications, or tandem repeats. Insome embodiments, the mutation is in DNA, such as cfDNA, cell-freemitochondrial DNA (cf mDNA), cell-free DNA that originated from nuclearDNA (cf nDNA), cellular DNA, or mitochondrial DNA. In some embodiments,the mutation is in RNA, such as cfRNA, cellular RNA, cytoplasmic RNA,coding cytoplasmic RNA, non-coding cytoplasmic RNA, mRNA, miRNA,mitochondrial RNA, rRNA, or tRNA. In some embodiments, the mutation ispresent at a higher frequency in subjects with a disease or disorder(such as cancer) than subjects without the disease or disorder (such ascancer). In some embodiments, the mutation is indicative of cancer, suchas a causative mutation. In some embodiments, the mutation is a drivermutation that has a causative role in the disease or disorder. In someembodiments, the mutation is not a causative mutation. For example, insome cancers, multiple mutations accumulate but some of them are notcausative mutations. Mutations (such as those that are present at ahigher frequency in subjects with a disease or disorder than subjectswithout the disease or disorder) that are not causative can still beuseful for diagnosing the disease or disorder. In some embodiments, themutation is loss-of-heterozygosity (LOH) at one or more microsatellites.

In some embodiments, a subject is screened for one of more polymorphismsor mutations that the subject is known to have (e.g., to test for theirpresence, a change in the amount of cells, DNA, or RNA with thesepolymorphisms or mutations, or cancer remission or re-occurrence). Insome embodiments, a subject is screened for one of more polymorphisms ormutations that the subject is known to be at risk for (such as a subjectwho has a relative with the polymorphism or mutation). In someembodiments, a subject is screened for a panel of polymorphisms ormutations associated with a disease or disorder such as cancer (e.g., atleast 5, 10, 50, 100, 200, 300, 500, 750, 1,000, 1,500, 2,000, or 5,000polymorphisms or mutations).

Many coding variants associated with cancer are described in Abaan etal., “The Exomes of the NCI-60 Panel: A Genomic Resource for CancerBiology and Systems Pharmacology”, Cancer Research, Jul. 15, 2013, andworld wide web atdtp.nci.nih.gov/branches/btb/characterizationNCI60.html, which are eachhereby incorporated by reference in its entirety). The NCI-60 humancancer cell line panel consists of 60 different cell lines representingcancers of the lung, colon, brain, ovary, breast, prostate, and kidney,as well as leukemia and melanoma. The genetic variations that wereidentified in these cell lines consisted of two types: type I variantsthat are found in the normal population, and type II variants that arecancer-specific.

Exemplary polymorphisms or mutations (such as deletions or duplications)are in one or more of the following genes: TP53, PTEN, PIK3CA, APC,EGFR, NRAS, NF2, FBXW7, ERBBs, ATAD5, KRAS, BRAF, VEGF, EGFR, HER2, ALK,p53, BRCA, BRCA1, BRCA2, SETD2, LRP1B, PBRM, SPTA1, DNMT3A, ARID1A,GRIN2A, TRRAP, STAG2, EPHA3/5/7, POLE, SYNE1, C20orf80, CSMD1, CTNNB1,ERBB2. FBXW7, KIT, MUC4, ATM, CDH1, DDX11, DDX12, DSPP, EPPK1, FAM186A,GNAS, HRNR, KRTAP4-11, MAP2K4, MLL3, NRAS, RB1, SMAD4, TTN, ABCC9,ACVR1B, ADAM29, ADAMTS19, AGAP10, AKT1, AMBN, AMPD2, ANKRD30A, ANKRD40,APOBR, AR, BIRC6, BMP2, BRAT1, BTNL8, C12orf4, C1QTNF7, C20orf186,CAPRIN2, CBWD1, CCDC30, CCDC93, CD5L, CDC27, CDC42BPA, CDH9, CDKN2A,CHD8, CHEK2, CHRNA9, CIZ1, CLSPN, CNTN6, COL14A1, CREBBP, CROCC, CTSF,CYP1A2, DCLK1, DHDDS, DHX32, DKK2, DLEC1, DNAH14, DNAH5, DNAH9,DNASE1L3, DUSP16, DYNC2H1, ECT2, EFHB, RRN3P2, TRIM49B, TUBB8P5, EPHA7,ERBB3, ERCC6, FAM21A, FAM21C, FCGBP, FGFR2, FLG2, FLT1, FOLR2, FRYL,FSCB, GAB1, GABRA4, GABRP, GH2, GOLGA6L1, GPHB5, GPR32, GPX5, GTF3C3,HECW1, HIST1H3B, HLA-A, HRAS, HS3ST1, HS6ST1, HSPD1, IDH1, JAK2, KDM5B,KIAA0528, KRT15, KRT38, KRTAP21-1, KRTAP4-5, KRTAP4-7, KRTAP5-4,KRTAP5-5, LAMA4, LATS1, LMF1, LPAR4, LPPR4, LRRFIP1, LUM, LYST, MAP2K1,MARCH1, MARCO, MB21D2, MEGF10, MMP16, MORC1, MRE11A, MTMR3, MUC12,MUC17, MUC2, MUC20, NBPF10, NBPF20, NEK1, NFE2L2, NLRP4, NOTCH2, NRK,NUP93, OBSCN, OR11H1, OR2B11, OR2M4, OR4Q3, OR5D13, OR812, OXSM, PIK3R1,PPP2R5C, PRAME, PRF1, PRG4, PRPF19, PTH2, PTPRC, PTPRJ, RAC1, RAD50,RBM12, RGPD3, RGS22, ROR1, RP11-671M22.1, RP13-996F3.4, RP1L1, RSBN1L,RYR3, SAMD3, SCN3A, SEC31A, SF1, SF3B1, SLC25A2, SLC44A1, SLC4A11,SMAD2, SPTA1, ST6GAL2, STK11, SZT2, TAF1L, TAX1BP1, TBP, TGFBI, TIF1,TMEM14B, TMEM74, TPTE, TRAPPC8, TRPS1, TXNDC6, USP32, UTP20, VASN,VPS72, WASH3P, WWTR1, XPO1, ZFHX4, ZMIZ1, ZNF167, ZNF436, ZNF492,ZNF598, ZRSR2, ABL1, AKT2, AKT3, ARAF, ARFRP1, ARID2, ASXL1, ATR, ATRX,AURKA, AURKB, AXL, BAP1, BARD1, BCL2, BCL2L2, BCL6, BCOR, BCORL1, BLM,BRIP1, BTK, CARD11, CBFB, CBL, CCND1, CCND2, CCND3, CCNE1, CD79A, CD79B,CDC73, CDK12, CDK4, CDK6, CDK8, CDKN1B, CDKN2B, CDKN2C, CEBPA, CHEK1,CIC, CRKL, CRLF2, CSF1R, CTCF, CTNNA1, DAXX, DDR2, DOT1L, EMSY(C11orf30), EP300, EPHA3, EPHA5, EPHB1, ERBB4, ERG, ESR1, EZH2, FAM123B(WTX), FAM46C, FANCA, FANCC, FANCD2, FANCE, FANCF, FANCG, FANCL, FGF10,FGF14, FGF19, FGF23, FGF3, FGF4, FGF6, FGFR1, FGFR2, FGFR3, FGFR4, FLT3,FLT4, FOXL2, GATA1, GATA2, GATA3, GID4 (C17orf39), GNA11, GNA13, GNAQ,GNAS, GPR124, GSK3B, HGF, IDH1, IDH2, IGF1R, IKBKE, IKZF1, IL7R, INHBA,IRF4, IRS2, JAK1, JAK3, JUN, KAT6A (MYST3), KDM5A, KDM5C, KDM6A, KDR,KEAP1, KLHL6, MAP2K2, MAP2K4, MAP3K1, MCL1, MDM2, MDM4, MED12, MEF2B,MEN1, MET, MITF, MLH1, MLL, MLL2, MPL, MSH2, MSH6, MTOR, MUTYH, MYC,MYCL1, MYCN, MYD88, NF1, NFKBIA, NKX2-1, NOTCH1, NPM1, NRAS, NTRK1,NTRK2, NTRK3, PAK3, PALB2, PAX5, PBRM1, PDGFRA, PDGFRB, PDK1, PIK3CG,PIK3R2, PPP2R1A, PRDM1, PRKAR1A, PRKDC, PTCH1, PTPN11, RAD51, RAF1,RARA, RET, RICTOR, RNF43, RPTOR, RUNX1, SMARCA4, SMARCB1, SMO, SOCS1,SOX10, SOX2, SPEN, SPOP, SRC, STAT4, SUFU, TET2, TGFBR2, TNFAIP3,TNFRSF14, TOP1, TP53, TSC1, TSC2, TSHR, VHL, WISP3, WT1, ZNF217, ZNF703,and combinations thereof (Su et al., J Mol Diagn 2011, 13:74-84;DOI:10.1016/j.jmoldx.2010.11.010; and Abaan et al., “The Exomes of theNCI-60 Panel: A Genomic Resource for Cancer Biology and SystemsPharmacology”, Cancer Research, Jul. 15, 2013, which are each herebyincorporated by reference in its entirety). In some embodiments, theduplication is a chromosome 1p (“Chr1p”) duplication associated withbreast cancer. In some embodiments, one or more polymorphisms ormutations are in BRAF, such as the V600E mutation. In some embodiments,one or more polymorphisms or mutations are in K-ras. In someembodiments, there is a combination of one or more polymorphisms ormutations in K-ras and APC. In some embodiments, there is a combinationof one or more polymorphisms or mutations in K-ras and p53. In someembodiments, there is a combination of one or more polymorphisms ormutations in APC and p53. In some embodiments, there is a combination ofone or more polymorphisms or mutations in K-ras, APC, and p53. In someembodiments, there is a combination of one or more polymorphisms ormutations in K-ras and EGFR. Exemplary polymorphisms or mutations are inone or more of the following microRNAs: miR-15a, miR-16-1, miR-23a,miR-23b, miR-24-1, miR-24-2, miR-27a, miR-27b, miR-29b-2, miR-29c,miR-146, miR-155, miR-221, miR-222, and miR-223 (Calin et al. “AmicroRNA signature associated with prognosis and progression in chroniclymphocytic leukemia.” N Engl J Med 353:1793-801, 2005, which is herebyincorporated by reference in its entirety).

In some embodiments, the deletion is a deletion of at least 0.01 kb, 0.1kb, 1 kb, 10 kb, 100 kb, 1 mb, 2 mb, 3 mb, 5 mb, 10 mb, 15 mb, 20 mb, 30mb, or 40 mb. In some embodiments, the deletion is a deletion of between1 kb to 40 mb, such as between 1 kb to 100 kb, 100 kb to 1 mb, 1 to 5mb, 5 to 10 mb, 10 to 15 mb, 15 to 20 mb, 20 to 25 mb, 25 to 30 mb, or30 to 40 mb, inclusive.

In some embodiments, the duplication is a duplication of at least 0.01kb, 0.1 kb, 1 kb, 10 kb, 100 kb, 1 mb, 2 mb, 3 mb, 5 mb, 10 mb, 15 mb,20 mb, 30 mb, or 40 mb. In some embodiments, the duplication is aduplication of between 1 kb to 40 mb, such as between 1 kb to 100 kb,100 kb to 1 mb, 1 to 5 mb, 5 to 10 mb, 10 to 15 mb, 15 to 20 mb, 20 to25 mb, 25 to 30 mb, or 30 to 40 mb, inclusive.

In some embodiments, the tandem repeat is a repeat of between 2 and 60nucleotides, such as 2 to 6, 7 to 10, 10 to 20, 20 to 30, 30 to 40, 40to 50, or 50 to 60 nucleotides, inclusive. In some embodiments, thetandem repeat is a repeat of 2 nucleotides (dinucleotide repeat). Insome embodiments, the tandem repeat is a repeat of 3 nucleotides(trinucleotide repeat).

In some embodiments, the polymorphism or mutation is prognostic.Exemplary prognostic mutations include K-ras mutations, such as K-rasmutations that are indicators of post-operative disease recurrence incolorectal cancer (Ryan et al. “A prospective study of circulatingmutant KRAS2 in the serum of patients with colorectal neoplasia: strongprognostic indicator in postoperative follow up,” Gut 52:101-108, 2003;and Lecomte T et al. Detection of free-circulating tumor-associated DNAin plasma of colorectal cancer patients and its association withprognosis,” Int J Cancer 100:542-548, 2002, which are each herebyincorporated by reference in its entirety).

In some embodiments, the polymorphism or mutation is associated withaltered response to a particular treatment (such as increased ordecreased efficacy or side-effects). Examples include K-ras mutationsare associated with decreased response to EGFR-based treatments innon-small cell lung cancer (Wang et al. “Potential clinical significanceof a plasma-based KRAS mutation analysis in patients with advancednon-small cell lung cancer,” Clin Canc Res 16:1324-1330, 2010, which ishereby incorporated by reference in its entirety).

K-ras is an oncogene that is activated in many cancers. Exemplary K-rasmutations are mutations in codons 12, 13, and 61. K-ras cfDNA mutationshave been identified in pancreatic, lung, colorectal, bladder, andgastric cancers (Fleischhacker & Schmidt “Circulating nucleic acids(CNAs) and caner—a survey,” Biochim Biophys Acta 1775:181-232, 2007,which is hereby incorporated by reference in its entirety).

p53 is a tumor suppressor that is mutated in many cancers andcontributes to tumor progression (Levine & Oren “The first 30 years ofp53: growing ever more complex. Nature Rev Cancer,” 9:749-758, 2009,which is hereby incorporated by reference in its entirety). Manydifferent codons can be mutated, such as Ser249. p53 cfDNA mutationshave been identified in breast, lung, ovarian, bladder, gastric,pancreatic, colorectal, bowel, and hepatocellular cancers (Fleischhacker& Schmidt “Circulating nucleic acids (CNAs) and caner—a survey,” BiochimBiophys Acta 1775:181-232, 2007, which is hereby incorporated byreference in its entirety).

BRAF is an oncogene downstream of Ras. BRAF mutations have beenidentified in glial neoplasm, melanoma, thyroid, and lung cancers(Dias-Santagata et al. BRAF V600E mutations are common in pleomorphicxanthoastrocytoma: diagnostic and therapeutic implications. PLOS ONE2011; 6:e17948, 2011; Shinozaki et al. Utility of circulating B-RAF DNAmutation in serum for monitoring melanoma patients receivingbiochemotherapy. Clin Canc Res 13:2068-2074, 2007; and Board et al.Detection of BRAF mutations in the tumor and serum of patients enrolledin the AZD6244 (ARRY-142886) advanced melanoma phase II study. Brit JCanc 2009; 101:1724-1730, which are each hereby incorporated byreference in its entirety). The BRAF V600E mutation occurs, e.g., inmelanoma tumors, and is more common in advanced stages. The V600Emutation has been detected in cfDNA

EGFR contributes to cell proliferation and is misregulated in manycancers (Downward J. Targeting RAS signaling pathways in cancer therapy.Nature Rev Cancer 3:11-22, 2003; and Levine & Oren “The first 30 yearsof p53: growing ever more complex. Nature Rev Cancer,” 9:749-758, 2009,which is hereby incorporated by reference in its entirety). ExemplaryEGFR mutations include those in exons 18-21, which have been identifiedin lung cancer patients. EGFR cfDNA mutations have been identified inlung cancer patients (Jia et al. “Prediction of epidermal growth factorreceptor mutations in the plasma/pleural effusion to efficacy ofgefitinib treatment in advanced non-small cell lung cancer,” J Canc ResClin Oncol 2010; 136:1341-1347, 2010, which is hereby incorporated byreference in its entirety).

Exemplary polymorphisms or mutations associated with breast cancerinclude LOH at microsatellites (Kohler et al. “Levels of plasmacirculating cell free nuclear and mitochondrial DNA as potentialbiomarkers for breast tumors,” Mol Cancer 8:doi:10.1186/1476-4598-8-105,2009, which is hereby incorporated by reference in its entirety), p53mutations (such as mutations in exons 5-8)(Garcia et al. “Extracellulartumor DNA in plasma and overall survival in breast cancer patients,”Genes, Chromosomes & Cancer 45:692-701, 2006, which is herebyincorporated by reference in its entirety), HER2 (Sorensen et al.“Circulating HER2 DNA after trastuzumab treatment predicts survival andresponse in breast cancer,” Anticancer Res 30:2463-2468, 2010, which ishereby incorporated by reference in its entirety), PIK3CA, MED1, andGAS6 polymorphisms or mutations (Murtaza et al. “Non-invasive analysisof acquired resistance to cancer therapy by sequencing of plasma DNA,”Nature 2013; doi:10.1038/nature12065, 2013, which is hereby incorporatedby reference in its entirety).

Increased cfDNA levels and LOH are associated with decreased overall anddisease-free survival. p53 mutations (exons 5-8) are associated withdecreased overall survival. Decreased circulating HER2 cfDNA levels areassociated with a better response to HER2-targeted treatment inHER2-positive breast tumor subjects. An activating mutation in PIK3CA, atruncation of MED1, and a splicing mutation in GAS6 result in resistanceto treatment.

Exemplary polymorphisms or mutations associated with colorectal cancerinclude p53, APC, K-ras, and thymidylate synthase mutations and p16 genemethylation (Wang et al. “Molecular detection of APC, K-ras, and p53mutations in the serum of colorectal cancer patients as circulatingbiomarkers,” World J Surg 28:721-726, 2004; Ryan et al. “A prospectivestudy of circulating mutant KRAS2 in the serum of patients withcolorectal neoplasia: strong prognostic indicator in postoperativefollow up,” Gut 52:101-108, 2003; Lecomte et al. “Detection offree-circulating tumor-associated DNA in plasma of colorectal cancerpatients and its association with prognosis,” Int J Cancer 100:542-548,2002; Schwarzenbach et al. “Molecular analysis of the polymorphisms ofthymidylate synthase on cell-free circulating DNA in blood of patientswith advanced colorectal carcinoma,” Int J Cancer 127:881-888, 2009,which are each hereby incorporated by reference in its entirety).Post-operative detection of K-ras mutations in serum is a strongpredictor of disease recurrence. Detection of K-ras mutations and p16gene methylation are associated with decreased survival and increaseddisease recurrence. Detection of K-ras, APC, and/or p53 mutations isassociated with recurrence and/or metastases. Polymorphisms (includingLOH, SNPs, variable number tandem repeats, and deletion) in thethymidylate synthase (the target of fluoropyrimidine-basedchemotherapies) gene using cfDNA may be associated with treatmentresponse.

Exemplary polymorphisms or mutations associated with lung cancer (suchas non-small cell lung cancer) include K-ras (such as mutations in codon12) and EGFR mutations. Exemplary prognostic mutations include EGFRmutations (exon 19 deletion or exon 21 mutation) associated withincreased overall and progression-free survival and K-ras mutations (incodons 12 and 13) are associated with decreased progression-freesurvival (Jian et al. “Prediction of epidermal growth factor receptormutations in the plasma/pleural effusion to efficacy of gefitinibtreatment in advanced non-small cell lung cancer,” J Canc Res Clin Oncol136:1341-1347, 2010; Wang et al. “Potential clinical significance of aplasma-based KRAS mutation analysis in patients with advanced non-smallcell lung cancer,” Clin Canc Res 16:1324-1330, 2010, which are eachhereby incorporated by reference in its entirety). Exemplarypolymorphisms or mutations indicative of response to treatment includeEGFR mutations (exon 19 deletion or exon 21 mutation) that improveresponse to treatment and K-ras mutations (codons 12 and 13) thatdecrease the response to treatment. A resistance-conferring mutation inEFGR has been identified (Murtaza et al. “Non-invasive analysis ofacquired resistance to cancer therapy by sequencing of plasma DNA,”Nature doi:10.1038/nature12065, 2013, which is hereby incorporated byreference in its entirety).

Exemplary polymorphisms or mutations associated with melanoma (such asuveal melanoma) include those in GNAQ, GNA11, BRAF, and p53. ExemplaryGNAQ and GNA11 mutations include R183 and Q209 mutations. Q209 mutationsin GNAQ or GNA11 are associated with metastases to bone. BRAF V600Emutations can be detected in patients with metastatic/advanced stagemelanoma. BRAF V600E is an indicator of invasive melanoma. The presenceof the BRAF V600E mutation after chemotherapy is associated with anon-response to the treatment

Exemplary polymorphisms or mutations associated with pancreaticcarcinomas include those in K-ras and p53 (such as p53 Ser249). p53Ser249 is also associated with hepatitis B infection and hepatocellularcarcinoma, as well as ovarian cancer, and non-Hodgkin's lymphoma.

Even polymorphisms or mutations that are present in low frequency in asample can be detected with the methods of the invention. For example, apolymorphism or mutation that is present at a frequency of 1 in amillion can be observed 10 times by performing 10 million sequencingreads. If desired, the number of sequencing reads can be altereddepending of the level of sensitivity desired. In some embodiments, asample is re-analyzed or another sample from a subject is analyzed usinga greater number of sequencing reads to improve the sensitivity. Forexample, if no or only a small number (such as 1, 2, 3, 4, or 5)polymorphisms or mutations that are associated with cancer or anincreased risk for cancer are detected, the sample is re-analyzed oranother sample is tested.

In some embodiments, multiple polymorphisms or mutations are requiredfor cancer or for metastatic cancer. In such cases, screening formultiple polymorphisms or mutations improves the ability to accuratelydiagnose cancer or metastatic cancer. In some embodiments when a subjecthas a subset of multiple polymorphisms or mutations that are requiredfor cancer or for metastatic cancer, the subject can be re-screenedlater to see if the subject acquires additional mutations.

In some embodiments in which multiple polymorphisms or mutations arerequired for cancer or for metastatic cancer, the frequency of eachpolymorphism or mutation can be compared to see if they occur at similarfrequencies. For example, if two mutations required for cancer (denoted“A” and “B”), some cells will have none, some cells with A, some with B,and some with A and B. If A and B are observed at similar frequencies,the subject is more likely to have some cells with both A and B. Ifobserver A and B at dissimilar frequencies, the subject is more likelyto have different cell populations.

In some embodiments in which multiple polymorphisms or mutations arerequired for cancer or for metastatic cancer, the number or identity ofsuch polymorphisms or mutations that are present in the subject can beused to predict how likely or soon the subject is likely to have thedisease or disorder. In some embodiments in which polymorphisms ormutations tend to occur in a certain order, the subject may beperiodically tested to see if the subject has acquired the otherpolymorphisms or mutations.

In some embodiments, determining the presence or absence of multiplepolymorphisms or mutations (such as 2, 3, 4, 5, 8, 10, 12, 15, or more)increases the sensitivity and/or specificity of the determination of thepresence or absence of a disease or disorder such as cancer, or anincreased risk for with a disease or disorder such as cancer.

In some embodiments, the polymorphism(s) or mutation(s) are directlydetected. In some embodiments, the polymorphism(s) or mutation(s) areindirectly detected by detection of one or more sequences (e.g., apolymorphic locus such as a SNP) that are linked to the polymorphism ormutation.

Exemplary Nucleic Acid Alterations

In some embodiments, there is a change to the integrity of RNA or DNA(such as a change in the size of fragmented cfRNA or cfDNA or a changein nucleosome composition) that is associated with a disease or disordersuch as cancer, or an increased risk for a disease or disorder such ascancer. In some embodiments, there is a change in the methylationpattern RNA or DNA that is associated with a disease or disorder such ascancer, or an increased risk for with a disease or disorder such ascancer (e.g., hypermethylation of tumor suppressor genes). For example,methylation of the CpG islands in the promoter region oftumor-suppressor genes has been suggested to trigger local genesilencing. Aberrant methylation of the p16 tumor suppressor gene occursin subjects with liver, lung, and breast cancer. Other frequentlymethylated tumor suppressor genes, including APC, Ras association domainfamily protein 1A (RASSF1A), glutathione S-transferase P1 (GSTP1), andDAPK, have been detected in various type of cancers, for examplenasopharyngeal carcinoma, colorectal cancer, lung cancer, esophagealcancer, prostate cancer, bladder cancer, melanoma, and acute leukemia.Methylation of certain tumor-suppressor genes, such as p16, has beendescribed as an early event in cancer formation, and thus is useful forearly cancer screening.

In some embodiments, bisulphite conversion or a non-bisulphite basedstrategy using methylation sensitive restriction enzyme digestion isused to determine the methylation pattern (Hung et al., J Clin Pathol62:308-313, 2009, which is hereby incorporated by reference in itsentirety). On bisulphite conversion, methylated cytosines remain ascytosines while unmethylated cytosines are converted to uracils.Methylation-sensitive restriction enzymes (e.g., BstUI) cleavesunmethylated DNA sequences at specific recognition sites (e.g.,5′-CG∨CG-3′ for BstUI), while methylated sequences remain intact. Insome embodiments, the intact methylated sequences are detected. In someembodiments, stem-loop primers are used to selectively amplifyrestriction enzyme-digested unmethylated fragments without co-amplifyingthe non-enzyme-digested methylated DNA.

Exemplary Changes in mRNA Splicing

In some embodiments, a change in mRNA splicing is associated with adisease or disorder such as cancer, or an increased risk for a diseaseor disorder such as cancer. In some embodiments, the change in mRNAsplicing is in one or more of the following nucleic acids associatedwith cancer or an increased risk for cancer: DNMT3B, BRCA1, KLF6, Ron,or Gemin5. In some embodiments, the detected mRNA splice variant isassociated with a disease or disorder, such as cancer. In someembodiments, multiple mRNA splice variants are produced by healthy cells(such as non-cancerous cells), but a change in the relative amounts ofthe mRNA splice variants is associated with a disease or disorder, suchas cancer. In some embodiments, the change in mRNA splicing is due to achange in the mRNA sequence (such as a mutation in a splice site), achange in splicing factor levels, a change in the amount of availablesplicing factor (such as a decrease in the amount of available splicingfactor due to the binding of a splicing factor to a repeat), alteredsplicing regulation, or the tumor microenvironment.

The splicing reaction is carried out by a multi-protein/RNA complexcalled the spliceosome (Fackenthal1 and Godley, Disease Models &Mechanisms 1: 37-42, 2008, doi:10.1242/dmm.000331, which is herebyincorporated by reference in its entirety). The spliceosome recognizesintron-exon boundaries and removes intervening introns via twotransesterification reactions that result in ligation of two adjacentexons. The fidelity of this reaction must be exquisite, because if theligation occurs incorrectly, normal protein-encoding potential may becompromised. For example, in cases where exon-skipping preserves thereading frame of the triplet codons specifying the identity and order ofamino acids during translation, the alternatively spliced mRNA mayspecify a protein that lacks crucial amino acid residues. More commonly,exon-skipping will disrupt the translational reading frame, resulting inpremature stop codons. These mRNAs are typically degraded by at least90% through a process known as nonsense-mediated mRNA degradation, whichreduces the likelihood that such defective messages will accumulate togenerate truncated protein products. If mis-spliced mRNAs escape thispathway, then truncated, mutated, or unstable proteins are produced.

Alternative splicing is a means of expressing several or many differenttranscripts from the same genomic DNA and results from the inclusion ofa subset of the available exons for a particular protein. By excludingone or more exons, certain protein domains may be lost from the encodedprotein, which can result in protein function loss or gain. Severaltypes of alternative splicing have been described: exon skipping;alternative 5′ or 3′ splice sites; mutually exclusive exons; and, muchmore rarely, intron retention. Others have compared the amount ofalternative splicing in cancer versus normal cells using a bioinformaticapproach and determined that cancers exhibit lower levels of alternativesplicing than normal cells. Furthermore, the distribution of the typesof alternative splicing events differed in cancer versus normal cells.Cancer cells demonstrated less exon skipping, but more alternative 5′and 3′ splice site selection and intron retention than normal cells.When the phenomenon of exonization (the use of sequences as exons thatare used predominantly by other tissues as introns) was examined, genesassociated with exonization in cancer cells were preferentiallyassociated with mRNA processing, indicating a direct link between cancercells and the generation of aberrant mRNA splice forms.

Exemplary Changes in DNA or RNA Levels

In some embodiments, there is a change in the total amount orconcentration of one or more types of DNA (such as cfDNA cf mDNA, cfnDNA, cellular DNA, or mitochondrial DNA) or RNA (cfRNA, cellular RNA,cytoplasmic RNA, coding cytoplasmic RNA, non-coding cytoplasmic RNA,mRNA, miRNA, mitochondrial RNA, rRNA, or tRNA). In some embodiments,there is a change in the amount or concentration of one or more specificDNA (such as cfDNA cf mDNA, cf nDNA, cellular DNA, or mitochondrial DNA)or RNA (cfRNA, cellular RNA, cytoplasmic RNA, coding cytoplasmic RNA,non-coding cytoplasmic RNA, mRNA, miRNA, mitochondrial RNA, rRNA, ortRNA) molecules. In some embodiments, one allele is expressed more thananother allele of a locus of interest. Exemplary miRNAs are short 20-22nucleotide RNA molecules that regulate the expression of a gene. In someembodiments, there is a change in the transcriptome, such as a change inthe identity or amount of one or more RNA molecules.

In some embodiments, an increase in the total amount or concentration ofcfDNA or cfRNA is associated with a disease or disorder such as cancer,or an increased risk for a disease or disorder such as cancer. In someembodiments, the total concentration of a type of DNA (such as cfDNA cfmDNA, cf nDNA, cellular DNA, or mitochondrial DNA) or RNA (cfRNA,cellular RNA, cytoplasmic RNA, coding cytoplasmic RNA, non-codingcytoplasmic RNA, mRNA, miRNA, mitochondrial RNA, rRNA, or tRNA)increases by at least 2, 3, 4, 5, 6, 7, 8, 9, 10-fold, or more comparedto the total concentration of that type of DNA or RNA in healthy (suchas non-cancerous) subjects. In some embodiments, a total concentrationof cfDNA between 75 to 100 ng/mL, 100 to 150 ng/mL, 150 to 200 ng/mL,200 to 300 ng/mL, 300 to 400 ng/mgL, 400 to 600 ng/mL, 600 to 800 ng/mL,800 to 1,000 ng/mL, inclusive, or a total concentration of cfDNA of morethan 100 ng, mL, such as more than 200, 300, 400, 500, 600, 700, 800,900, or 1,000 ng/mL is indicative of cancer, an increased risk forcancer, an increased risk of a tumor being malignant rather than benign,a decreased probably of the cancer going into remission, or a worseprognosis for the cancer. In some embodiments, the amount of a type ofDNA (such as cfDNA cf mDNA, cf nDNA, cellular DNA, or mitochondrial DNA)or RNA (cfRNA, cellular RNA, cytoplasmic RNA, coding cytoplasmic RNA,non-coding cytoplasmic RNA, mRNA, miRNA, mitochondrial RNA, rRNA, ortRNA) having one or more polymorphisms/mutations (such as deletions orduplications) associated with a disease or disorder such as cancer or anincreased risk for a disease or disorder such as cancer is at least 2,3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 16, 18, 20, or 25% of the totalamount of that type of DNA or RNA. In some embodiments, at least 2, 3,4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 16, 18, 20, or 25% of the total amountof a type of DNA (such as cfDNA cf mDNA, cf nDNA, cellular DNA, ormitochondrial DNA) or RNA (cfRNA, cellular RNA, cytoplasmic RNA, codingcytoplasmic RNA, non-coding cytoplasmic RNA, mRNA, miRNA, mitochondrialRNA, rRNA, or tRNA) has a particular polymorphism or mutation (such as adeletion or duplication) associated with a disease or disorder such ascancer or an increased risk for a disease or disorder such as cancer.

In some embodiments, the cfDNA is encapsulated. In some embodiments, thecfDNA is not encapsulated.

In some embodiments, the fraction of tumor DNA out of total DNA (such asfraction of tumor cfDNA out of total cfDNA or fraction of tumor cfDNAwith a particular mutation out of total cfDNA) is determined. In someembodiments, the fraction of tumor DNA may be determined for a pluralityof mutations, where the mutations can be single nucleotide variants,copy number variants, differential methylation, or combinations thereof.In some embodiments, the average tumor fraction calculated for one or aset of mutations with the highest calculated tumor fraction is taken asthe actual tumor fraction in the sample. In some embodiments, theaverage tumor fraction calculated for all of the mutations is taken asthe actual tumor fraction in the sample. In some embodiments, this tumorfraction is used to stage a cancer (since higher tumor fractions can beassociated with more advanced stages of cancer). In some embodiments,the tumor fraction is used to size a cancer, since larger tumors may becorrelated with the fraction of tumor DNA in the plasma. In someembodiments, the tumor fraction is used to size the proportion of atumor that is afflicted with a single or plurality of mutations, sincethere may be a correlation between the measured tumor fraction in aplasma sample and the size of tissue with a given mutation(s) genotype.For example, the size of tissue with a given mutation(s) genotype may becorrelated with the fraction of tumor DNA that may be calculated byfocusing on that particular mutation(s).

Exemplary Databases

The invention also features databases containing one or more resultsfrom a method of the invention. For example, the database may includerecords with any of the following information for one or more subjects:any polymorphisms/mutations (such as CNVs) identified, any knownassociation of the polymorphisms/mutations with a disease or disorder oran increased risk for a disease or disorder, effect of thepolymorphisms/mutations on the expression or activity level of theencoded mRNA or protein, fraction of DNA, RNA, or cells associated witha disease or disorder (such as DNA, RNA, or cells havingpolymorphism/mutation associated with a disease or disorder) out of thetotal DNA, RNA, or cells in sample, source of sample used to identifythe polymorphisms/mutations (such as a blood sample or sample from aparticular tissue), number of diseased cells, results from laterrepeating the test (such as repeating the test to monitor theprogression or remission of the disease or disorder), results of othertests for the disease or disorder, type of disease or disorder thesubject was diagnosed with, treatment(s) administered, response to suchtreatment(s), side-effects of such treatment(s), symptoms (such assymptoms associated with the disease or disorder), length and number ofremissions, length of survival (such as length of time from initial testuntil death or length of time from diagnosis until death), cause ofdeath, and combinations thereof.

In some embodiments, the database includes records with any of thefollowing information for one or more subjects: anypolymorphisms/mutations identified, any known association of thepolymorphisms/mutations with cancer or an increased risk for cancer,effect of the polymorphisms/mutations on the expression or activitylevel of the encoded mRNA or protein, fraction of cancerous DNA, RNA orcells out of the total DNA, RNA, or cells in sample, source of sampleused to identify the polymorphisms/mutations (such as a blood sample orsample from a particular tissue), number of cancerous cells, size oftumor(s), results from later repeating the test (such as repeating thetest to monitor the progression or remission of the cancer), results ofother tests for cancer, type of cancer the subject was diagnosed with,treatment(s) administered, response to such treatment(s), side-effectsof such treatment(s), symptoms (such as symptoms associated withcancer), length and number of remissions, length of survival (such aslength of time from initial test until death or length of time fromcancer diagnosis until death), cause of death, and combinations thereof.In some embodiments, the response to treatment includes any of thefollowing: reducing or stabilizing the size of a tumor (e.g., a benignor cancerous tumor), slowing or preventing an increase in the size of atumor, reducing or stabilizing the number of tumor cells, increasing thedisease-free survival time between the disappearance of a tumor and itsreappearance, preventing an initial or subsequent occurrence of a tumor,reducing or stabilizing an adverse symptom associated with a tumor, orcombinations thereof. In some embodiments, the results from one or moreother tests for a disease or disorder such as cancer are included, suchas results from screening tests, medical imaging, or microscopicexamination of a tissue sample.

In one such aspect, the invention features an electronic databaseincluding at least 5, 10, 10², 10³, 10⁴, 10⁵, 10⁶, 10⁷, 10⁸ or morerecords. In some embodiments, the database has records for at least 5,10, 10², 10³, 10⁴, 10⁵, 10⁶, 10⁷, 10⁸ or more different subjects.

In another aspect, the invention features a computer including adatabase of the invention and a user interface. In some embodiments, theuser interface is capable of displaying a portion or all of theinformation contained in one or more records. In some embodiments, theuser interface is capable of displaying (i) one or more types of cancerthat have been identified as containing a polymorphism or mutation whoserecord is stored in the computer, (ii) one or more polymorphisms ormutations that have been identified in a particular type of cancer whoserecord is stored in the computer, (iii) prognosis information for aparticular type of cancer or a particular a polymorphism or mutationwhose record is stored in the computer (iv) one or more compounds orother treatments useful for cancer with a polymorphism or mutation whoserecord is stored in the computer, (v) one or more compounds thatmodulate the expression or activity of an mRNA or protein whose recordis stored in the computer, and (vi) one or more mRNA molecules orproteins whose expression or activity is modulated by a compound whoserecord is stored in the computer. The internal components of thecomputer typically include a processor coupled to a memory. The externalcomponents usually include a mass-storage device, e.g., a hard diskdrive; user input devices, e.g., a keyboard and a mouse; a display,e.g., a monitor; and optionally, a network link capable of connectingthe computer system to other computers to allow sharing of data andprocessing tasks. Programs may be loaded into the memory of this systemduring operation.

In another aspect, the invention features a computer-implemented processthat includes one or more steps of any of the methods of the invention.

Exemplary Risk Factors

In some embodiments, the subject is also evaluated for one or more riskfactors for a disease or disorder, such as cancer. Exemplary riskfactors include family history for the disease or disorder, lifestyle(such as smoking and exposure to carcinogens) and the level of one ormore hormones or serum proteins (such as alpha-fetoprotein (AFP) inliver cancer, carcinoembryonic antigen (CEA) in colorectal cancer, orprostate-specific antigen (PSA) in prostate cancer). In someembodiments, the size and/or number of tumors is measured and use indetermining a subject's prognosis or selecting a treatment for thesubject.

Exemplary Screening Methods

If desired, the presence or absence of a disease or disorder such cancercan be confirmed, or the disease or disorder such as cancer can beclassified using any standard method. For example, a disease or disordersuch as cancer can be detected in a number of ways, including thepresence of certain signs and symptoms, tumor biopsy, screening tests,or medical imaging (such as a mammogram or an ultrasound). Once apossible cancer is detected, it may be diagnosed by microscopicexamination of a tissue sample. In some embodiments, a subject diagnosedundergoes repeat testing using a method of the invention or knowntesting for the disease or disorder at multiple time points to monitorthe progression of the disease or disorder or the remission orreoccurrence of the disease or disorder.

Exemplary Cancers

Exemplary cancers that can be diagnosed, prognosed, stabilized, treated,or prevented using any of the methods of the invention include solidtumors, carcinomas, sarcomas, lymphomas, leukemias, germ cell tumors, orblastomas. In various embodiments, the cancer is an acute lymphoblasticleukemia, acute myeloid leukemia, adrenocortical carcinoma, AIDS-relatedcancer, AIDS-related lymphoma, anal cancer, appendix cancer, astrocytoma(such as childhood cerebellar or cerebral astrocytoma), basal-cellcarcinoma, bile duct cancer (such as extrahepatic bile duct cancer)bladder cancer, bone tumor (such as osteosarcoma or malignant fibroushistiocytoma), brainstem glioma, brain cancer (such as cerebellarastrocytoma, cerebral astrocytoma/malignant glioma, ependymo,medulloblastoma, supratentorial primitive neuroectodermal tumors, orvisual pathway and hypothalamic glioma), glioblastoma, breast cancer,bronchial adenoma or carcinoid, burkitt's lymphoma, carcinoid tumor(such as a childhood or gastrointestinal carcinoid tumor), carcinomacentral nervous system lymphoma, cerebellar astrocytoma or malignantglioma (such as childhood cerebellar astrocytoma or malignant glioma),cervical cancer, childhood cancer, chronic lymphocytic leukemia, chronicmyelogenous leukemia, chronic myeloproliferative disorders, coloncancer, cutaneous t-cell lymphoma, desmoplastic small round cell tumor,endometrial cancer, ependymoma, esophageal cancer, ewing's sarcoma,tumor in the ewing family of tumors, extracranial germ cell tumor (suchas a childhood extracranial germ cell tumor), extragonadal germ celltumor, eye cancer (such as intraocular melanoma or retinoblastoma eyecancer), gallbladder cancer, gastric cancer, gastrointestinal carcinoidtumor, gastrointestinal stromal tumor, germ cell tumor (such asextracranial, extragonadal, or ovarian germ cell tumor), gestationaltrophoblastic tumor, glioma (such as brain stem, childhood cerebralastrocytoma, or childhood visual pathway and hypothalamic glioma),gastric carcinoid, hairy cell leukemia, head and neck cancer, heartcancer, hepatocellular (liver) cancer, hodgkin lymphoma, hypopharyngealcancer, hypothalamic and visual pathway glioma (such as childhood visualpathway glioma), islet cell carcinoma (such as endocrine or pancreasislet cell carcinoma), kaposi sarcoma, kidney cancer, laryngeal cancer,leukemia (such as acute lymphoblastic, acute myeloid, chroniclymphocytic, chronic myelogenous, or hairy cell leukemia), lip or oralcavity cancer, liposarcoma, liver cancer (such as non-small cell orsmall cell cancer), lung cancer, lymphoma (such as AIDS-related,burkitt, cutaneous T cell, Hodgkin, non-hodgkin, or central nervoussystem lymphoma), macroglobulinemia (such as waldenströmmacroglobulinemia, malignant fibrous histiocytoma of bone orosteosarcoma, medulloblastoma (such as childhood medulloblastoma),melanoma, merkel cell carcinoma, mesothelioma (such as adult orchildhood mesothelioma), metastatic squamous neck cancer with occult,mouth cancer, multiple endocrine neoplasia syndrome (such as childhoodmultiple endocrine neoplasia syndrome), multiple myeloma or plasma cellneoplasm. mycosis fungoides, myelodysplastic syndrome, myelodysplasticor myeloproliferative disease, myelogenous leukemia (such as chronicmyelogenous leukemia), myeloid leukemia (such as adult acute orchildhood acute myeloid leukemia), myeloproliferative disorder (such aschronic myeloproliferative disorder), nasal cavity or paranasal sinuscancer, nasopharyngeal carcinoma, neuroblastoma, oral cancer,oropharyngeal cancer, osteosarcoma or malignant fibrous histiocytoma ofbone, ovarian cancer, ovarian epithelial cancer, ovarian germ celltumor, ovarian low malignant potential tumor, pancreatic cancer (such asislet cell pancreatic cancer), paranasal sinus or nasal cavity cancer,parathyroid cancer, penile cancer, pharyngeal cancer, pheochromocytoma,pineal astrocytoma, pineal germinoma. pineoblastoma or supratentorialprimitive neuroectodermal tumor (such as childhood pineoblastoma orsupratentorial primitive neuroectodermal tumor), pituitary adenoma,plasma cell neoplasia, pleuropulmonary blastoma, primary central nervoussystem lymphoma, cancer, rectal cancer, renal cell carcinoma, renalpelvis or ureter cancer (such as renal pelvis or ureter transitionalcell cancer, retinoblastoma, rhabdomyosarcoma (such as childhoodrhabdomyosarcoma), salivary gland cancer, sarcoma (such as sarcoma inthe ewing family of tumors, Kaposi, soft tissue, or uterine sarcoma),sézary syndrome, skin cancer (such as nonmelanoma, melanoma, or merkelcell skin cancer), small intestine cancer, squamous cell carcinoma,supratentorial primitive neuroectodermal tumor (such as childhoodsupratentorial primitive neuroectodermal tumor), T-cell lymphoma (suchas cutaneous T-cell lymphoma), testicular cancer, throat cancer, thymoma(such as childhood thymoma), thymoma or thymic carcinoma, thyroid cancer(such as childhood thyroid cancer), trophoblastic tumor (such asgestational trophoblastic tumor), unknown primary site carcinoma (suchas adult or childhood unknown primary site carcinoma), urethral cancer(such as endometrial uterine cancer), uterine sarcoma, vaginal cancer,visual pathway or hypothalamic glioma (such as childhood visual pathwayor hypothalamic glioma), vulvar cancer, waldenström macroglobulinemia,or wilms tumor (such as childhood wilms tumor). In various embodiments,the cancer has metastasized or has not metastasized.

The cancer may or may not be a hormone related or dependent cancer(e.g., an estrogen or androgen related cancer). Benign tumors ormalignant tumors may be diagnosed, prognosed, stabilized, treated, orprevented using the methods and/or compositions of the presentinvention.

In some embodiments, the subject has a cancer syndrome. A cancersyndrome is a genetic disorder in which genetic mutations in one or moregenes predispose the affected individuals to the development of cancersand may also cause the early onset of these cancers. Cancer syndromesoften show not only a high lifetime risk of developing cancer, but alsothe development of multiple independent primary tumors. Many of thesesyndromes are caused by mutations in tumor suppressor genes, genes thatare involved in protecting the cell from turning cancerous. Other genesthat may be affected are DNA repair genes, oncogenes and genes involvedin the production of blood vessels (angiogenesis). Common examples ofinherited cancer syndromes are hereditary breast-ovarian cancer syndromeand hereditary non-polyposis colon cancer (Lynch syndrome).

In some embodiments, a subject with one or more polymorphisms ormutations n K-ras, p53, BRA, EGFR, or HER2 is administered a treatmentthat targets K-ras, p53, BRA, EGFR, or HER2, respectively.

The methods of the invention can be generally applied to the treatmentof malignant or benign tumors of any cell, tissue, or organ type.

Exemplary Treatments

If desired, any treatment for stabilizing, treating, or preventing adisease or disorder such as cancer or an increased risk for a disease ordisorder such as cancer can be administered to a subject (e.g., asubject identified as having cancer or an increased risk for cancerusing any of the methods of the invention). In various embodiments, thetreatment is a known treatment or combination of treatments for adisease or disorder such as cancer, such as cytotoxic agents, targetedtherapy, immunotherapy, hormonal therapy, radiation therapy, surgicalremoval of cancerous cells or cells likely to become cancerous, stemcell transplantation, bone marrow transplantation, photodynamic therapy,palliative treatment, or a combination thereof. In some embodiments, atreatment (such as a preventative medication) is used to prevent, delay,or reduce the severity of a disease or disorder such as cancer in asubject at increased risk for a disease or disorder such as cancer.

In some embodiments, the targeted therapy is a treatment that targetsthe cancer's specific genes, proteins, or the tissue environment thatcontributes to cancer growth and survival. This type of treatment blocksthe growth and spread of cancer cells while limiting damage to normalcells, usually leading to fewer side effects than other cancermedications.

One of the more successful approaches has been to target angiogenesis,the new blood vessel growth around a tumor. Targeted therapies such asbevacizumab (Avastin), lenalidomide (Revlimid), sorafenib (Nexavar),sunitinib (Sutent), and thalidomide (Thalomid) interfere withangiogenesis. Another example is the use of a treatment that targetsHER2, such as trastuzumab or lapatinib, for cancers that overexpressHER2 (such as some breast cancers). In some embodiments, a monoclonalantibody is used to block a specific target on the outside of cancercells. Examples include alemtuzumab (Campath-1H), bevacizumab, cetuximab(Erbitux), panitumumab (Vectibix), pertuzumab (Omnitarg), rituximab(Rituxan), and trastuzumab. In some embodiments, the monoclonal antibodytositumomab (Bexxar) is used to deliver radiation to the tumor. In someembodiments, an oral small molecule inhibits a cancer process inside ofa cancer cell. Examples include dasatinib (Sprycel), erlotinib(Tarceva), gefitinib (Iressa), imatinib (Gleevec), lapatinib (Tykerb),nilotinib (Tasigna), sorafenib, sunitinib, and temsirolimus (Torisel).In some embodiments, a proteasome inhibitor (such as the multiplemyeloma drug, bortezomib (Velcade)) interferes with specialized proteinscalled enzymes that break down other proteins in the cell.

In some embodiments, immunotherapy is designed to boost the body'snatural defenses to fight the cancer. Exemplary types of immunotherapyuse materials made either by the body or in a laboratory to bolster,target, or restore immune system function.

In some embodiments, hormonal therapy treats cancer by lowering theamounts of hormones in the body. Several types of cancer, including somebreast and prostate cancers, only grow and spread in the presence ofnatural chemicals in the body called hormones. In various embodiments,hormonal therapy is used to treat cancers of the prostate, breast,thyroid, and reproductive system.

In some embodiments, the treatment includes a stem cell transplant inwhich diseased bone marrow is replaced by highly specialized cells,called hematopoietic stem cells. Hematopoietic stem cells are found bothin the bloodstream and in the bone marrow.

In some embodiments, the treatment includes photodynamic therapy, whichuses special drugs, called photosensitizing agents, along with light tokill cancer cells. The drugs work after they have been activated bycertain kinds of light.

In some embodiments, the treatment includes surgical removal ofcancerous cells or cells likely to become cancerous (such as alumpectomy or a mastectomy). For example, a woman with a breast cancersusceptibility gene mutation (BRCA1 or BRCA2 gene mutation) may reduceher risk of breast and ovarian cancer with a risk reducingsalpingo-oophorectomy (removal of the fallopian tubes and ovaries)and/or a risk reducing bilateral mastectomy (removal of both breasts).Lasers, which are very powerful, precise beams of light, can be usedinstead of blades (scalpels) for very careful surgical work, includingtreating some cancers.

In addition to treatment to slow, stop, or eliminate the cancer (alsocalled disease-directed treatment), an important part of cancer care isrelieving a subject's symptoms and side effects, such as pain andnausea. It includes supporting the subject with physical, emotional, andsocial needs, an approach called palliative or supportive care. Peopleoften receive disease-directed therapy and treatment to ease symptoms atthe same time.

Exemplary treatments include actinomycin D, adcetris, Adriamycin,aldesleukin, alemtuzumab, alimta, amsidine, amsacrine, anastrozole,aredia, arimidex, aromasin, asparaginase, avastin, bevacizumab,bicalutamide, bleomycin, bondronat, bonefos, bortezomib, busilvex,busulphan, campto, capecitabine, carboplatin, carmustine, casodex,cetuximab, chimax, chlorambucil, cimetidine, cisplatin, cladribine,clodronate, clofarabine, crisantaspase, cyclophosphamide, cyproteroneacetate, cyprostat, cytarabine, cytoxan, dacarbozine, dactinomycin,dasatinib, daunorubicin, dexamethasone, diethylstilbestrol, docetaxel,doxorubicin, drogenil, emcyt, epirubicin, eposin, Erbitux, erlotinib,estracyte, estramustine, etopophos, etoposide, evoltra, exemestane,fareston, femara, filgrastim, fludara, fludarabine, fluorouracil,flutamide, gefinitib, gemcitabine, gemzar, gleevec, glivec. gonapeptyldepot, goserelin, halaven, herceptin, hycamptin, hydroxycarbamide,ibandronic acid, ibritumomab, idarubicin, ifosfomide, interferon,imatinib mesylate, iressa, irinotecan, jevtana, lanvis, lapatinib,letrozole, leukeran, leuprorelin, leustat, lomustine, mabcampath,mabthera, megace, megestrol, methotrexate, mitozantrone, mitomycin,mutulane, myleran, navelbine, neulasta, neupogen, nexavar, nipent,nolvadex D, novantron, oncovin, paclitaxel, pamidronate, PCV,pemetrexed, pentostatin, perjeta, procarbazine, provenge, prednisolone,prostrap, raltitrexed, rituximab, sprycel, sorafenib, soltamox,streptozocin, stilboestrol, stimuvax, sunitinib, sutent, tabloid,tagamet, tamofen, tamoxifen, tarceva, taxol, taxotere, tegafur withuracil, temodal, temozolomide, thalidomide, thioplex, thiotepa,tioguanine, tomudex, topotecan, toremifene, trastuzumab, tretinoin,treosulfan, triethylenethiophorsphoramide, triptorelin, tyverb, uftoral,velcade, vepesid, vesanoid, vincristine, vinorelbine, xalkori, xeloda,yervoy, zactima, zanosar, zavedos, zevelin, zoladex, zoledronate, zometazoledronic acid, and zytiga.

For subjects that express both a mutant form (e.g., a cancer-relatedform) and a wild-type form (e.g., a form not associated with cancer) ofan mRNA or protein, the therapy preferably inhibits the expression oractivity of the mutant form by at least 2, 5, 10, or 20-fold more thanit inhibits the expression or activity of the wild-type form. Thesimultaneous or sequential use of multiple therapeutic agents maygreatly reduce the incidence of cancer and reduce the number of treatedcancers that become resistant to therapy. In addition, therapeuticagents that are used as part of a combination therapy may require alower dose to treat cancer than the corresponding dose required when thetherapeutic agents are used individually. The low dose of each compoundin the combination therapy reduces the severity of potential adverseside-effects from the compounds.

In some embodiments, a subject identified as having an increased risk ofcancer may invention or any standard method), avoid specific riskfactors, or make lifestyle changes to reduce any additional risk ofcancer.

In some embodiments, the polymorphisms, mutations, risk factors, or anycombination thereof are used to select a treatment regimen for thesubject. In some embodiments, a larger dose or greater number oftreatments is selected for a subject at greater risk of cancer or with aworse prognosis.

Other Compounds for Inclusion in Individual or Combination Therapies

If desired, additional compounds for stabilizing, treating, orpreventing a disease or disorder such as cancer or an increased risk fora disease or disorder such as cancer may be identified from largelibraries of both natural product or synthetic (or semi-synthetic)extracts or chemical libraries according to methods known in the art.Those skilled in the field or drug discovery and development willunderstand that the precise source of test extracts or compounds is notcritical to the methods of the invention. Accordingly, virtually anynumber of chemical extracts or compounds can be screened for theireffect on cells from a particular type of cancer or from a particularsubject or screened for their effect on the activity or expression ofcancer related molecules (such as cancer related molecules known to havealtered activity or expression in a particular type of cancer). When acrude extract is found to modulate the activity or expression of acancer related molecule, further fractionation of the positive leadextract may be performed to isolate chemical constituent responsible forthe observed effect using methods known in the art.

Exemplary Assays and Animal Models for the Testing of Therapies

If desired, one or more of the treatment disclosed herein can be testedfor their effect on a disease or disorder such as cancer using a cellline (such as a cell line with one or more of the mutations identifiedin the subject who has been diagnosed with cancer or an increased riskof cancer using the methods of the invention) or an animal model of thedisease or disorder, such as a SCID mouse model (Jain et al., TumorModels In Cancer Research, ed. Teicher, Humana Press Inc., Totowa, N.J.,pp. 647-671, 2001, which is hereby incorporated by reference in itsentirety). Additionally, there are numerous standard assays and animalmodels that can be used to determine the efficacy of particulartherapies for stabilizing, treating, or preventing a disease or disordersuch as cancer or an increased risk for a disease or disorder such ascancer. Therapies can also be tested in standard human clinical trials.

For the selection of a preferred therapy for a particular subject,compounds can be tested for their effect on the expression or activityon one or more genes that are mutated in the subject. For example, theability of a compound to modulate the expression of particular mRNAmolecules or proteins can be detected using standard Northern, Western,or microarray analysis. In some embodiments, one or more compounds areselected that (i) inhibit the expression or activity of mRNA moleculesor proteins that promote cancer that are expressed at a higher thannormal level or have a higher than normal level of activity in thesubject (such as in a sample from the subject) or (ii) promote theexpression or activity of mRNA molecules or proteins that inhibit cancerthat are expressed at a lower than normal level or have a lower thannormal level of activity in the subject. An individual or combinationtherapy that (i) modulates the greatest number of mRNA molecules orproteins that have mutations associated with cancer in the subject and(ii) modulates the least number of mRNA molecules or proteins that donot have mutations associated with cancer in the subject. In someembodiments, the selected individual or combination therapy has highdrug efficacy and produces few, if any, adverse side-effects.

As an alternative to the subject-specific analysis described above, DNAchips can be used to compare the expression of mRNA molecules in aparticular type of early or late-stage cancer (e.g., breast cancercells) to the expression in normal tissue (Marrack et al., CurrentOpinion in Immunology 12, 206-209, 2000; Harkin, Oncologist. 5:501-507,2000; Pelizzari et al., Nucleic Acids Res. 28(22):4577-4581, 2000, whichare each hereby incorporated by reference in its entirety). Based onthis analysis, an individual or combination therapy for subjects withthis type of cancer can be selected to modulate the expression of themRNA or proteins that have altered expression in this type of cancer.

In addition to being used to select a therapy for a particular subjector group of subjects, expression profiling can be used to monitor thechanges in mRNA and/or protein expression that occur during treatment.For example, expression profiling can be used to determine whether theexpression of cancer related genes has returned to normal levels. Ifnot, the dose of one or more compounds in the therapy can be altered toeither increase or decrease the effect of the therapy on the expressionlevels of the corresponding cancer related gene(s). In addition, thisanalysis can be used to determine whether a therapy affects theexpression of other genes (e.g., genes that are associated with adverseside-effects). If desired, the dose or composition of the therapy can bealtered to prevent or reduce undesired side-effects.

Exemplary Formulations and Methods of Administration

For stabilizing, treating, or preventing a disease or disorder such ascancer or an increased risk for a disease or disorder such as cancer, acomposition may be formulated and administered using any method known tothose of skill in the art (see, e.g., U.S. Pat. Nos. 8,389,578 and8,389,557, which are each hereby incorporated by reference in itsentirety). General techniques for formulation and administration arefound in “Remington: The Science and Practice of Pharmacy,” 21stEdition, Ed. David Troy, 2006, Lippincott Williams & Wilkins,Philadelphia, Pa., which is hereby incorporated by reference in itsentirety). Liquids, slurries, tablets, capsules, pills, powders,granules, gels, ointments, suppositories, injections, inhalants, andaerosols are examples of such formulations. By way of example, modifiedor extended release oral formulation can be prepared using additionalmethods known in the art. For example, a suitable extended release formof an active ingredient may be a matrix tablet or capsule composition.Suitable matrix forming materials include, for example, waxes (e.g.,carnauba, bees wax, paraffin wax, ceresin, shellac wax, fatty acids, andfatty alcohols), oils, hardened oils or fats (e.g., hardened rapeseedoil, castor oil, beef tallow, palm oil, and soya bean oil), and polymers(e.g., hydroxypropyl cellulose, polyvinylpyrrolidone, hydroxypropylmethyl cellulose, and polyethylene glycol). Other suitable matrixtableting materials are microcrystalline cellulose, powdered cellulose,hydroxypropyl cellulose, ethyl cellulose, with other carriers, andfillers. Tablets may also contain granulates, coated powders, orpellets. Tablets may also be multi-layered. Optionally, the finishedtablet may be coated or uncoated.

Typical routes of administering such compositions include, withoutlimitation, oral, sublingual, buccal, topical, transdermal, inhalation,parenteral (e.g., subcutaneous, intravenous, intramuscular, intrasternalinjection, or infusion techniques), rectal, vaginal, and intranasal. Inpreferred embodiments, the therapy is administered using an extendedrelease device. Compositions of the invention are formulated so as toallow the active ingredient(s) contained therein to be bioavailable uponadministration of the composition. Compositions may take the form of oneor more dosage units. Compositions may contain 1, 2, 3, 4, or moreactive ingredients and may optionally contain 1, 2, 3, 4, or moreinactive ingredients.

Alternate Embodiments

Any of the methods described herein may include the output of data in aphysical format, such as on a computer screen, or on a paper printout.Any of the methods of the invention may be combined with the output ofthe actionable data in a format that can be acted upon by a physician.Some of the embodiments described in the document for determininggenetic data pertaining to a target individual may be combined with thenotification of a potential chromosomal abnormality (such as a deletionor duplication), or lack thereof, with a medical professional,optionally combined with the decision to abort, or to not abort, a fetusin the context of prenatal diagnosis. Some of the embodiments describedherein may be combined with the output of the actionable data, and theexecution of a clinical decision that results in a clinical treatment,or the execution of a clinical decision to make no action.

In some embodiments, a method is disclosed herein for generating areport disclosing a result of any method of the invention (such as thepresence or absence of a deletion or duplication). A report may begenerated with a result from a method of the invention, and it may besent to a physician electronically, displayed on an output device (suchas a digital report), or a written report (such as a printed hard copyof the report) may be delivered to the physician. In addition, thedescribed methods may be combined with the actual execution of aclinical decision that results in a clinical treatment, or the executionof a clinical decision to make no action.

In certain embodiments, the present invention provides reagents, kits,and methods, and computer systems and computer media with encodedinstructions for performing such methods, for detecting both CNVs andSNVs from the same sample using the multiplex PCR methods disclosedherein. In certain preferred embodiments the sample is a single cellsample or a plasma sample suspected of containing circulating tumor DNA.These embodiments take advantage of the discovery that by interrogatingDNA samples from single cells or plasma for CNVs and SNVs using thehighly sensitive multiplex PCR methods disclosed herein, improved cancerdetection can be achieved, versus interrogating for either CNVs or SNVsalone, especially for cancers exhibiting CNV such as breast, ovarian,and lung cancer. The methods in certain illustrative embodiments foranalyzing CNVs interrogate for between 50 and 100,000 or 50 and 10,000,or 50 and 1,000 SNPs and for SNVs interrogate for between 50 and 1000SNVs or for between 50 and 500 SNVs or for between 50 and 250 SNVs. Themethods provided herein for detecting CNVs and/or SNVs in plasma ofsubjects suspected of having cancer, including for example, cancersknown to exhibit CNVs and SNVs, such as breast, lung, and ovariancancer, provide the advantage of detecting CNVs and/or SNVs from tumorsthat often are composed of heterogeneous cancer cell populations interms of genetic compositions. Thus, traditional methods, which focus onanalyzing only certain regions of the tumors can often miss CNVs or SNVsthat are present in cells in other regions of the tumor. The plasmasamples act as liquid biopsies that can be interrogated to detect any ofthe CNVs and/or SNVs that are present in only subpopulations of tumorcells.

Example Computer Architecture

FIG. 69 shows an example system architecture X00 useful for performingembodiments of the present invention. System architecture X00 includesan analysis platform X08 connected to one or more laboratory informationsystems (“LISs”) X04. As shown in FIG. 69 , analysis platform X08 may beconnected to LIS X04 over a network X02. Network X02 may include one ormore networks of one or more network types, including any combination ofLAN, WAN, the Internet, etc. Network X02 may encompass connectionsbetween any or all components in system architecture X00. Analysisplatform X08 may alternatively or additionally be connected directly toLIS X06. In an embodiment, analysis platform X08 analyzes genetic dataprovided by LIS X04 in a software-as-a-service model, where LIS X04 is athird-party LIS, while analysis platform X08 analyzes genetic dataprovided by LIS X06 in a full-service or in-house model, where LIS X06and analysis platform X08 are controlled by the same party. In anembodiment where analysis platform X08 is providing information overnetwork X02, analysis platform X08 may be a server.

In an example embodiment, laboratory information system X04 includes oneor more public or private institutions that collect, manage, and/orstore genetic data. A person having skill in the relevant art(s) wouldunderstand that methods and standards for securing genetic data areknown and can be implemented using various information securitytechniques and policies, e.g., username/password, Transport LayerSecurity (TLS), Secure Sockets Layer (SSL), and/or other cryptographicprotocols providing communication security.

In an example embodiment, system architecture X00 operates as aservice-oriented architecture and uses a client-server model that wouldbe understood by one of skill in the relevant art(s) to enable variousforms of interaction and communication between LIS X04 and analysisplatform X08. System architecture X00 may be distributed over varioustypes of networks X02 and/or may operate as cloud computingarchitecture. Cloud computing architecture may include any type ofdistributed network architecture. By way of example and not oflimitation, cloud computing architecture is useful for providingsoftware as a service (SaaS), infrastructure as a service (IaaS),platform as a service (PaaS), network as a service (NaaS), data as aservice (DaaS), database as a service (DBaaS), backend as a service(BaaS), test environment as a service (TEaaS), API as a service(APIaaS), integration platform as a service (IPaaS) etc.

In an example embodiment, LISs X04 and X06 each include a computer,device, interface, etc. or any sub-system thereof. LISs X04 and X06 mayinclude an operating system (OS), applications installed to performvarious functions such as, for example, access to and/or navigation ofdata made accessible locally, in memory, and/or over network X02. In anembodiment, LIS X04 accesses analysis platform X08 through anapplication programming interface (“API”). LIS X04 may also include oneor more native applications that may operate independently of an API.

In an example embodiment, analysis platform X08 includes one or more ofan input processor X12, a hypothesis manager X14, a modeler X16, anerror correction unit X18, a machine learning unit X20, and an outputprocessor X18. Input processor X12 receives and processes inputs fromLISs X04 and/or X06. Processing may include but is not limited tooperations such as parsing, transcoding, translating, adapting, orotherwise handling any input received from LISs X04 and/or X06. Inputsmay be received via one or more streams, feeds, databases, or othersources of data, such as may be made accessible by LISs X04 and X06.Data errors may be corrected by error correction unit X18 throughperformance of the error correction mechanisms described above.

In an example embodiment, hypothesis manager X14 is configured toreceive the inputs passed from input processor X12 in a form ready to beprocessed in accordance with hypotheses for genetic analysis that arerepresented as models and/or algorithms. Such models and/or algorithmsmay be used by modeler X16 to generate probabilities, for example, basedon dynamic, real-time, and/or historical statistics or other indicators.Data used to derive and populate such strategy models and/or algorithmsare available to hypothesis manager X14 via, for example, genetic datasource X10. Genetic data source X10 may include, for example, a nucleicacid sequencer. Hypothesis manager X14 may be configured to formulatehypotheses based on, for example, the variables required to populate itsmodels and/or algorithms. Models and/or algorithms, once populated, maybe used by modeler X16 to generate one or more hypotheses as describedabove. Hypothesis manager X14 may select a particular value, range ofvalues, or estimate based on a most-likely hypothesis as an output asdescribed above. Modeler X16 may operate in accordance with modelsand/or algorithms trained by machine learning unit X20. For example,machine learning unit X20 may develop such models and/or algorithms byapplying a classification algorithm as described above to a training setdatabase (not shown). In certain embodiments, the machine learning unitanalyzes one or more control samples to generate training data setsuseful in SNV detections methods provided herein.

Once hypothesis manager X14 has identified a particular output, suchoutput may be returned to the particular LIS 104 or 106 requesting theinformation by output processor X22.

Various aspects of the disclosure can be implemented on a computingdevice by software, firmware, hardware, or a combination thereof. FIG.70 illustrates an example computer system Y00 in which the contemplatedembodiments, or portions thereof, can be implemented ascomputer-readable code. Various embodiments are described in terms ofthis example computer system Y00.

Processing tasks in the embodiment of FIG. 70 are carried out by one ormore processors Y02. However, it should be noted that various types ofprocessing technology may be used here, including programmable logicarrays (PLAs), application-specific integrated circuits (ASICs),multi-core processors, multiple processors, or distributed processors.Additional specialized processing resources such as graphics,multimedia, or mathematical processing capabilities may also be used toaid in certain processing tasks. These processing resources may behardware, software, or an appropriate combination thereof. For example,one or more of processors Y02 may be a graphics-processing unit (GPU).In an embodiment, a GPU is a processor that is a specialized electroniccircuit designed to rapidly process mathematically intensiveapplications on electronic devices. The GPU may have a highly parallelstructure that is efficient for parallel processing of large blocks ofdata, such as mathematically intensive data. Alternatively or inaddition, one or more of processors Y02 may be a special parallelprocessing without the graphics optimization, such parallel processorsperforming the mathematically intensive functions described herein. Oneor more of processors Y02 may include a processing accelerator (e.g.,DSP or other special-purpose processor).

Computer system Y00 also includes a main memory Y30, and may alsoinclude a secondary memory Y40. Main memory Y30 may be a volatile memoryor non-volatile memory, and divided into channels. Secondary memory Y40may include, for example, non-volatile memory such as a hard disk driveY50, a removable storage drive Y60, and/or a memory stick. Removablestorage drive Y60 may comprise a floppy disk drive, a magnetic tapedrive, an optical disk drive, a flash memory, or the like. The removablestorage drive Y60 reads from and/or writes to a removable storage unit470 in a well-known manner. Removable storage unit Y70 may comprise afloppy disk, magnetic tape, optical disk, etc. which is read by andwritten to by removable storage drive Y60. As will be appreciated bypersons skilled in the relevant art(s), removable storage unit Y70includes a computer usable storage medium having stored therein computersoftware and/or data.

In alternative implementations, secondary memory Y40 may include othersimilar means for allowing computer programs or other instructions to beloaded into computer system Y00. Such means may include, for example, aremovable storage unit Y70 and an interface (not shown). Examples ofsuch means may include a program cartridge and cartridge interface (suchas that found in video game devices), a removable memory chip (such asan EPROM, or PROM) and associated socket, and other removable storageunits Y70 and interfaces which allow software and data to be transferredfrom the removable storage unit Y70 to computer system Y00.

Computer system Y00 may also include a memory controller Y75. Memorycontroller Y75 controls data access to main memory Y30 and secondarymemory Y40. In some embodiments, memory controller Y75 may be externalto processor Y10, as shown in FIG. 70 . In other embodiments, memorycontroller Y75 may also be directly part of processor Y10. For example,many AMD™ and Intel™ processors use integrated memory controllers thatare part of the same chip as processor Y10 (not shown in FIG. 70 ).

Computer system Y00 may also include a communications and networkinterface Y80. Communication and network interface Y80 allows softwareand data to be transferred between computer system Y00 and externaldevices. Communications and network interface Y80 may include a modem, acommunications port, a PCMCIA slot and card, or the like. Software anddata transferred via communications and network interface Y80 are in theform of signals which may be electronic, electromagnetic, optical, orother signals capable of being received by communication and networkinterface Y80. These signals are provided to communication and networkinterface Y80 via a communication path Y85. Communication path Y85carries signals and may be implemented using wire or cable, fiberoptics, a phone line, a cellular phone link, an RF link or othercommunications channels.

The communication and network interface Y80 allows the computer systemY00 to communicate over communication networks or mediums such as LANs,WANs the Internet, etc. The communication and network interface Y80 mayinterface with remote sites or networks via wired or wirelessconnections.

In this document, the terms “computer program medium,” “computer-usablemedium” and “non-transitory medium” are used to generally refer totangible media such as removable storage unit Y70, removable storagedrive Y60, and a hard disk installed in hard disk drive Y50. Signalscarried over communication path Y85 can also embody the logic describedherein. Computer program medium and computer usable medium can alsorefer to memories, such as main memory Y30 and secondary memory Y40,which can be memory semiconductors (e.g. DRAMs, etc.). These computerprogram products are means for providing software to computer systemY00.

Computer programs (also called computer control logic) are stored inmain memory Y30 and/or secondary memory Y40. Computer programs may alsobe received via communication and network interface Y80. Such computerprograms, when executed, enable computer system Y00 to implementembodiments as discussed herein. In particular, the computer programs,when executed, enable processor Y10 to implement the disclosedprocesses. Accordingly, such computer programs represent controllers ofthe computer system Y00. Where the embodiments are implemented usingsoftware, the software may be stored in a computer program product andloaded into computer system Y00 using removable storage drive Y60,interfaces, hard drive Y50 or communication and network interface Y80,for example.

The computer system Y00 may also include input/output/display devicesY90, such as keyboards, monitors, pointing devices, touchscreens, etc.

It should be noted that the simulation, synthesis and/or manufacture ofvarious embodiments may be accomplished, in part, through the use ofcomputer readable code, including general programming languages (such asC or C++), hardware description languages (HDL) such as, for example,Verilog HDL, VHDL, Altera HDL (AHDL), or other available programmingtools. This computer readable code can be disposed in any knowncomputer-usable medium including a semiconductor, magnetic disk, opticaldisk (such as CD-ROM, DVD-ROM). As such, the code can be transmittedover communication networks including the Internet.

The embodiments are also directed to computer program productscomprising software stored on any computer-usable medium. Such software,when executed in one or more data processing devices, causes a dataprocessing device(s) to operate as described herein. Embodiments employany computer-usable or -readable medium, and any computer-usable or-readable storage medium known now or in the future. Examples ofcomputer-usable or computer-readable mediums include, but are notlimited to, primary storage devices (e.g., any type of random accessmemory), secondary storage devices (e.g., hard drives, floppy disks, CDROMS, ZIP disks, tapes, magnetic storage devices, optical storagedevices, MEMS, nano-technological storage devices, etc.), andcommunication mediums (e.g., wired and wireless communications networks,local area networks, wide area networks, intranets, etc.).Computer-usable or computer-readable mediums can include any form oftransitory (which include signals) or non-transitory media (whichexclude signals). Non-transitory media comprise, by way of non-limitingexample, the aforementioned physical storage devices (e.g., primary andsecondary storage devices).

It will be understood that any of the embodiments disclosed herein canbe used in combination with any other embodiment disclosed herein.

Experimental Section

The presently disclosed embodiments are described in the followingExamples, which are set forth to aid in the understanding of thedisclosure, and should not be construed to limit in any way the scope ofthe disclosure as defined in the claims which follow thereafter. Thefollowing examples are put forth so as to provide those of ordinaryskill in the art with a complete disclosure and description of how touse the described embodiments, and is not intended to limit the scope ofthe disclosure nor is it intended to represent that the experimentsbelow are all or the only experiments performed. Efforts have been madeto ensure accuracy with respect to numbers used (e.g., amounts,temperature, etc.) but some experimental errors and deviations should beaccounted for. Unless indicated otherwise, parts are parts by volume,and temperature is in degrees Centigrade. It should be understood thatvariations in the methods as described may be made without changing thefundamental aspects that the experiments are meant to illustrate.

Example 1

Exemplary sample preparation and amplification methods are described inU.S. application Ser. No. 13/683,604, filed Nov. 21, 2012; U.S.Publication No. 2013/0123120, and U.S. Ser. No. 61/994,791, filed May16, 2014, which is hereby incorporated by reference in its entirety.These methods can be used for analysis of any of the samples disclosedherein.

In one experiment, plasma samples were prepared and amplified using ahemi-nested 19,488-plex protocol. The samples were prepared in thefollowing way: up to 20 mL of blood were centrifuged to isolate thebuffy coat and the plasma. The genomic DNA in the blood sample wasprepared from the buffy coat. Genomic DNA can also be prepared from asaliva sample. Cell-free DNA in the plasma was isolated using the QIAGENCIRCULATING NUCLEIC ACID kit and eluted in 50 uL TE buffer according tomanufacturer's instructions. Universal ligation adapters were appendedto the end of each molecule of 40 uL of purified plasma DNA andlibraries were amplified for 9 cycles using adaptor specific primers.Libraries were purified with AGENCOURT AMPURE beads and eluted in 50 ulDNA suspension buffer.

6 ul of the DNA was amplified with 15 cycles of STAR 1 (95° C. for 10min for initial polymerase activation, then 15 cycles of 96° C. for 30s; 65° C. for 1 min; 58° C. for 6 min; 60° C. for 8 min; 65° C. for 4min and 72° C. for 30s; and a final extension at 72° C. for 2 min) using7.5 nM primer concentration of 19,488 target-specific tagged reverseprimers and one library adaptor specific forward primer at 500 nM.

The hemi-nested PCR protocol involved a second amplification of adilution of the STAR 1 product for 15 cycles (STAR 2) (95° C. for 10 minfor initial polymerase activation, then 15 cycles of 95° C. for 30s; 65°C. for 1 min; 60° C. for 5 min; 65° C. for 5 min and 72° C. for 30s; anda final extension at 72° C. for 2 min) using reverse tag concentrationof 1000 nM, and a concentration of 20 nM for each of 19,488target-specific forward primers.

An aliquot of the STAR 2 products was then amplified by standard PCR for12 cycles with 1 uM of tag-specific forward and barcoded reverse primersto generate barcoded sequencing libraries. An aliquot of each librarywas mixed with libraries of different barcodes and purified using a spincolumn.

In this way, 19,488 primers were used in the single-well reactions; theprimers were designed to target SNPs found on chromosomes 1, 2, 13, 18,21, X and Y. The amplicons were then sequenced using an ILLUMINA GAIIXsequencer. If desired, the number of sequencing reads can be increasedto increase the number of targeted SNPs that are amplified andsequenced.

Relevant genomic DNA samples amplified using a semi-nested 19,488 outerforward primers and tagged reverse primers at 7.5 nM in the STAR 1.Thermocycling conditions and composition of STAR 2, and the barcodingPCR were the same as for the hemi-nested protocol.

Example 2

Exemplary primer selection methods are described in U.S. applicationSer. No. 13/683,604, filed Nov. 21, 2012 (U.S. Publication No.2013/0123120) and U.S. Ser. No. 61/994,791, filed May 16, 2014, which ishereby incorporated by reference in its entirety). These methods can beused for analysis of any of the samples disclosed herein.

The following experiment illustrates an exemplary method for designingand selecting a library of primers that can be used in any of themultiplexed PCR methods of the invention. The goal is to select primersfrom an initial library of candidate primers that can be used tosimultaneously amplify a large number of target loci (or a subset oftarget loci) in a single reaction volume. For an initial set ofcandidate target loci, primers did not have to be designed or selectedfor each target locus. Preferably, primers are designed and selected fora large portion of the most desirable target loci.

Step 1

A set of candidate target loci (such as SNPs) were selected based onpublicly available information about desired parameters for the targetloci, such as frequency of the SNPs within a target population orheterozygosity rate of the SNPs (worldwide web atncbi.nlm.nih.gov/projects/SNP/; Sherry S T, Ward M H, Kholodov M, et al.dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2001Jan. 1; 29(1):308-11, which are each incorporated by reference in itsentirety). For each candidate locus, one or more PCR primer pairs weredesigned using the Primer3 program (the worldwide web atprimer3.sourceforge.net; libprimer3 release 2.2.3, which is herebyincorporated by reference in its entirety). If there were no feasibledesigns for PCR primers for a particular target locus, then that targetlocus was eliminated from further consideration.

If desired, a “target locus score” (higher score representing higherdesirability) can be calculated for most or all of the target loci, suchas a target locus score calculated based on a weighted average ofvarious desired parameters for the target loci. The parameters may beassigned different weights based on their importance for the particularapplication that the primers will be used for. Exemplary parametersinclude the heterozygosity rate of the target locus, the diseaseprevalence associated with a sequence (e.g., a polymorphism) at thetarget locus, the disease penetrance associated with a sequence (e.g., apolymorphism) at the target locus, the specificity of the candidateprimer(s) used to amplify the target locus, the size of the candidateprimer(s) used to amply the target locus, and the size of the targetamplicon. In some embodiments, the specificity of the candidate primerfor the target locus includes the likelihood that the candidate primerwill mis-prime by binding and amplifying a locus other than the targetlocus it was designed to amplify. In some embodiments, one or more orall the candidate primers that mis-prime are removed from the library.

Step 2

A thermodynamic interaction score was calculated between each primer andall primers for all other target loci from Step 1 (see, e.g., Allawi, H.T. & SantaLucia, J., Jr. (1998), “Thermodynamics of Internal C-TMismatches in DNA”, Nucleic Acids Res. 26, 2694-2701; Peyret, N.,Seneviratne, P. A., Allawi, H. T. & SantaLucia, J., Jr. (1999),“Nearest-Neighbor Thermodynamics and NMR of DNA Sequences with InternalA-A, C-C, G-G, and T-T Mismatches”, Biochemistry 38, 3468-3477; Allawi,H. T. & SantaLucia, J., Jr. (1998), “Nearest-Neighbor Thermodynamics ofInternal A-C Mismatches in DNA: Sequence Dependence and pH Effects”,Biochemistry 37, 9435-9444; Allawi, H. T. & SantaLucia, J., Jr. (1998),“Nearest Neighbor Thermodynamic Parameters for Internal G-A Mismatchesin DNA”, Biochemistry 37, 2170-2179; and Allawi, H. T. & SantaLucia, J.,Jr. (1997), “Thermodynamics and NMR of Internal G-T Mismatches in DNA”,Biochemistry 36, 10581-10594; MultiPLX 2.1 (Kaplinski L, Andreson R,Puurand T, Remm M. MultiPLX: automatic grouping and evaluation of PCRprimers. Bioinformatics. 2005 Apr. 15; 21(8):1701-2, which are eachhereby incorporated by reference in its entirety). This step resulted ina 2D matrix of interaction scores. The interaction score predicted thelikelihood of primer-dimers involving the two interacting primers. Thescore was calculated as follows:

interaction score=max(−deltaG_2, 0.8*(−deltaG_1))

where

deltaG_2=Gibbs energy (energy required to break the dimer) for a dimerthat is extensible by PCR on both ends, i.e., the 3′ end of each primeranneals to the other primer; and

deltaG_1=Gibbs energy for a dimer that is extensible by PCR on at leastone end.

Step 3:

For each target locus, if there was more than one primer-pair design,then one design was selected using the following method:

-   -   1 For each primer-pair design for the locus, find the worst-case        (highest) interaction score for the two primers in that design        and all primers from all designs for all other target loci.    -   2 Pick the design with the best (lowest) worst-case interaction        score.

Step 4

A graph was built such that each node represented one locus and itsassociated primer-pair design (e.g., a Maximal Clique problem). One edgewas created between every pair of nodes. A weight was assigned to eachedge equal to the worst-case (highest) interaction score between theprimers associated with the two nodes connected by the edge.

Step 5

If desired, for every pair of designs for two different target lociwhere one of the primers from one design and one of the primers from theother design would anneal to overlapping target regions, an additionaledge was added between the nodes for the two design. The weight of theseedges was set equal to the highest weight assigned in Step 4. Thus, Step5 prevents the library from having primers that would anneal tooverlapping target regions, and thus interfere with each other during amultiplex PCR reaction.

Step 6

An initial interaction score threshold was calculated as follows:

weight_threshold=max(edge_weight)−0.05*(max(edge_weight)−min(edge_weight))

where

max(edge_weight) is the maximum edge weight in the graph; and

min(edge_weight) is the minimum edge weight in the graph.

The initial bounds for the threshold were set as follows:

max_weight_threshold=max(edge_weight)

min_weight_threshold=min(edge_weight)

Step 7

A new graph was constructed consisting of the same set of nodes as thegraph from Step 5, only including edges with weights that exceedweight_threshold. Thus, step ignores interactions with scores equal toor below weight_threshold.

Step 8

Nodes (and all of the edges connected to the removed nodes) were removedfrom the graph of Step 7 until there were no edges left. Nodes wereremoved by applying the following procedure repeatedly:

-   -   1 Find the node with the highest degree (highest number of        edges). If there is more than one then pick one arbitrarily.    -   2 Define the set of nodes consisting of the node picked above        and all of the nodes connected to it, but excluding any nodes        that have degree less than the node picked above.    -   3 Choose the node from the set that has the lowest target locus        score (lower score representing lower desirability) from Step 1.        Remove that node from the graph.

Step 9

If the number of nodes remaining in the graph satisfies the requirednumber of target loci for the multiplexed PCR pool (within an acceptabletolerance), then the method was continued at

Step 10.

If there were too many or too few nodes remaining in the graph, then abinary search was performed to determine what threshold values wouldresult in the desired number of nodes remaining in the graphs. If therewere too many nodes in the graph then, the weight threshold bounds wereadjusted as follows:

max_weight_threshold=weight_threshold

Otherwise (if there are two few nodes in the graph), then theweight_threshold bounds were adjusted as follows:

min_weight_threshold=weight_threshold

Then, the weight_threshold was adjusted follows:

weight_threshold=(max_weight_threshold+min_weight_threshold)/2

Steps 7-9 were repeated.

Step 10

The primer-pair designs associated with the nodes remaining in the graphwere selected for the library of primers. This primer library can beused in any of the methods of the invention.

If desired, this method of designing and selecting primers can beperformed for primer libraries in which only one primer (instead of aprimer pair) is used for amplification of a target locus. In this case,a node presents one primer per target locus (rather than a primer pair).

Example 3

If desired, methods of the invention can be tested to evaluate theirability to detect a deletion or duplication of a chromosome orchromosome segment. The following experiment was performed todemonstrate the detection of an overrepresentation of the X chromosomeor a segment from the X chromosome inherited from the father compared tothe X chromosome or X chromosome segment from the mother. This assay isdesigned to mimic a deletion or duplication of a chromosome orchromosome segment. Different amounts of DNA from a father (with XY sexchromosomes) were mixed with DNA from a daughter (with XX sexchromosomes) of the father for analysis of the extra amount of Xchromosome from the father (FIGS. 19A-19D).

DNA from father and daughter cells lines was extracted and quantifiedusing Qubit. Father cell line AG16782, cAG16782-2-F and daughter cellline AG16777, cAG16777-2-P were used. To determine the father'shaplotype for the X chromosome, SNPs were detected that are present onthe X chromosome but not on the Y chromosome, so there would be a signalfrom the father's X chromosome but not Y chromosome. The daughterinherited this haplotype from the father. The haplotype from the other Xchromosome in the daughter was inherited from her mother. This haplotypefrom the mother can be determined by assigning the SNPs in the DNA fromthe daughter cell line that were not inherited from the father to thehaplotype from the mother.

To determine whether an overrepresentation of the X chromosome from thefather could be detected, different amounts DNA from the father cellline were mixed with DNA from the daughter cell line. The total DNAinput was approximately 75 ng (˜25 k copies) of genomic DNA.Approximately 3,456 SNPs were amplified using direct multiplex PCR for Xand Y chromosome assays. The amplified products were sequenced using 50bp single run sequencing with 7 bp barcodes using the Rapid/HT mode. Thenumber of reads was approximately 10K per SNP.

As shown in FIGS. 19A-19D, mosaicism from the father's DNA could bedetected. These figures indicate that chromosomes segments or entirechromosomes that are overrepresented can be detected.

All patents, patent applications, and published references cited hereinare hereby incorporated by reference in their entirety. While themethods of the present disclosure have been described in connection withthe specific embodiments thereof, it will be understood that it iscapable of further modification. Furthermore, this application isintended to cover any variations, uses, or adaptations of the methods ofthe present disclosure, including such departures from the presentdisclosure as come within known or customary practice in the art towhich the methods of the present disclosure pertain, and as fall withinthe scope of the appended claims. Any of the embodiments of theinvention can be performed by analyzing the DNA and/or RNA in a sample.For example, any of the methods disclosed herein for DNA can be readilyadapted for RNA, for example, by including a reverse transcription stepto convert the RNA into DNA.

Example 4

This example describes an exemplary method for non-invasive cell-freetumor DNA-based detection of breast cancer-related copy numbervariations. Breast cancer screening involves mammography, which resultsin a high false positive rate and misses some cancers. Analysis oftumor-derived circulating cell-free DNA (ctDNA) for cancer-associatedCNVs may allow for earlier, safer, and more accurate screening. ASNP-based massively multiplex PCR (mmPCR) approach was used to screenfor CNVs in ctDNA isolated from the plasma of breast cancer patients.The mmPCR assay was designed to target 3,168 SNPs on chromosomes 1, 2,and 22, which often have CNVs in cancer (e.g., 49% of breast cancersamples have a 22q deletion). Six plasma samples from breast cancerpatients—one stage IIa, four stage IIb, and one stage IIIb—wereanalyzed. Each sample had CNVs on one or more of the targetedchromosomes. The assay identified CNVs in all six plasma samples,including in one stage IIb sample that was correctly called at a ctDNAfraction of 0.58% (FIGS. 30, 31B, 32A, 32B, and 33 ); detection onlyrequired 86 heterozygous SNPs. A stage IIa sample was also correctedcalled at a ctDNA fraction of 4.33% using approximately 636 heterozygousSNPs (FIGS. 29, 31A, and 32A). This demonstrates that focal or wholechromosome arm CNVs, both common in cancer, can be readily detected.

To further evaluate sensitivity, 22 artificial mixtures containing a 3Mb 22q CNV from a cancer cell line were mixed with DNA from a normalcell line (5:95) to simulate a ctDNA fraction of between 0.43% and 7.35%(FIGS. 28A-28C). The method correctly detected CNVs in 100% of thesesamples. Thus, artificial cfDNA polynucleotide standards/controls can bemade by spiking isolated polynucleotide samples that include fragmentedpolynucleotide mixtures generated by non-cfDNA sources known to exhibitCNV, such as tumor cell lines, into other DNA samples at concentrationssimilar to those observed for cfDNA in vivo, such as between, forexample, 0.01% and 20%, 0.1 and 15%, or 0.4 and 10% of DNA in thatfluid. These standards/controls can be used as controls for assaydesign, characterization, development, and/or validation, and as qualitycontrol standards during testing, such as cancer testing performed in aCLIA lab and/or as standards included in research use only or diagnostictest kits. Significantly, in numerous cancers—including breast andovarian—CNVs are more prevalent relative to point mutations. Together,this supports that this SNP-based mmPCR approach offers acost-effective, non-invasive method for detecting these cancers.

Example 5

This example describes an exemplary method for detection of copy numbervariations in breast cancer samples using SNP-targeted massivelymultiplexed PCR. Evaluation of CNV in tumor tissues typically involvesSNP microarray or aCGH. These methods have high whole-genome resolution,but require large amounts of input material, have high fixed costs, anddo not work well on formaldehyde fixed-paraffin embedded (FFPE) samples.For this example, 28,000-plex SNP-targeted PCR with next generationsequencing (NGS) was used to target 1p, 1q, 2p, 2q, 4p16, 5p15, 7q11,15q, 17p, 22q11, 22q13 and chromosomes 13, 18, 21 and X for detection ofCNVs in breast cancer samples. Accuracy was validated on 96 samples withaneuploidies or microdeletions. Single-molecule sensitivity wasestablished by analyzing single cells. Of 17 breast cancer samples (15fresh frozen and 2 FFPE tumor tissues, 5 pairs of matched tumor andnormal cell lines) analyzed, 16 (including both FFPEs) were observedwith full or partial CNVs in one to 15 targets (average: 7.8); evidenceof tumor heterogeneity was observed. The three tissues with one CNV allhad a 1q duplication, the most frequent cytogenetic abnormality inbreast carcinoma. The most frequent regions with CNVs were 1q, 7p, and22q1. Only one tumor tissue (with 9 CNVs) had a region with LOH; thisLOH was also detected in adjacent putatively normal tissue that lackedthe other 8 CNVs. By contrast, 5 or more regions with LOH and a hightotal CNV incidence (average: 12.8) was detected in cell lines. Thus,massively multiplexed PCR offers an economical high-throughput approachto investigate CNVs in a targeted manner, and is applicable todifficult-to-analyze samples, such as FFPE tissues.

Example 6

This example illustrates exemplary methods for calculating the limit ofdetection for any of the methods of the invention. These methods wereused to calculate the limit of detection for single nucleotide variants(SNVs) in a tumor biopsy (FIG. 34 ) and a plasma sample (FIG. 35 ).

The first method (denoted “LOD-mr5” in FIGS. 34 and 35 ) calculates thelimit of detection based on a minimum of 5 reads being chosen as theminimum number of times a SNV is observed in the sequencing data to havesufficient confidence the SNV is actually present. The limit ofdetection is based on whether the observed the depth of read (DOR) isabove this minimum of 5. The gray lines in FIGS. 34 and 35 indicate SNVsfor which the limit of detection is limited by the DOR. In these cases,not enough reads were measured to reach the error limit of the assay. Ifdesired, the limit of detection can be improved (resulting in a lowernumerical value) for these SNVs by increasing the DOR.

The second method (denoted “LOD-zs5.0” in FIGS. 34 and 35 ) calculatesthe limit of detection based on the z-score. The Z-score is the numberof standard deviations an observed error percentage is away from thebackground mean error. If desired, outliers can be removed and thez-score can be recalculated and this process can be repeated. The finalweighted mean and the standard deviation of the error rate are used tocalculate the z-score. The mean is weighted by the DOR since theaccuracy is higher when the DOR is higher.

For the exemplary z-score calculation used for this example, thebackground mean error and standard deviation were calculated from allthe other samples of the same sequencing run weighted by their depth ofread, for each genomic locus and substitution type. Samples were notconsidered in the background distribution if they were 5 standarddeviations away from the background mean. The orange lines in FIGS. 34and 35 indicate SNVs for which the limit of detection is limited by theerror rate. For these SNV's enough reads were taken to reach the 5 readminimum, and the limit of detection was limited by the error rate. Ifdesired, the limit of detection can be improved by optimizing the assayto reduce the error rate.

The third method (denoted “LOD-zs5.0-mr5” in FIGS. 34 and 35 )calculates the limit of detection based on the maximum value of theabove two metrics.

For the analysis of a tumor sample shown in FIG. 34 , the mean limit ofdetection was 0.36%, and the median limit of detection was 0.28%. Thenumber of DOR limited (gray lines) SNVs was 934. The number of errorrate limited (orange lines) SNVs was 738.

For the analysis of cDNA in a plasma sample shown in FIG. 35 , the meanlimit of detection was 0.24%, and the median limit of detection was0.09%. The number of DOR limited (gray lines) SNVs was 732. The numberof error rate limited (orange lines) SNVs was 921.

Example 7

This example illustrates the detection of CNVs and SNVs from the samesingle cell. The following primer libraries were used: a library of˜28,000 primers for detecting CNVs, a library of ˜3,000 primers fordetecting CNVs, and library of primers for detecting SNVs. For analysisof a single cell, cells were serial diluted until there were 3 or 4cells per droplet. An individual cell was pipetted and placed into a PCRtube. The cell was lysed using Protease K, salt, and DTT using thefollowing conditions: 56° C. for 20 minutes, 95° C. for 10 minutes, andthen a 4° C. hold. For analysis of genomic DNA, DNA from the same cellline as the analyzed single cell was either purchased or obtained bygrowing the cells and extracting the DNA.

For amplification with the library of ˜28,000 primers, the following PCRconditions were used: a 40 uL reaction volume, 7.5 nM of each primer,and 2×master mix (MM). In some embodiments QIAGEN Multiplex PCR Kit isused for the master mix (QIAGEN catalog No. 206143; see, e.g.,information available at the world wide web atqiagen.com/products/catalog/assay-technologies/end-point-per-and-rt-per-reagents/qiagen-multiplex-per-kit,is which is hereby incorporated by reference in its entirety). The kitincludes 2×QIAGEN Multiplex PCR Master Mix (providing a finalconcentration of 3 mM MgCl₂, 3×0.85 ml), 5× Q-Solution (1×2.0 ml), andRNase-Free Water (2×1.7 ml). The QIAGEN Multiplex PCR Master Mix (MM)contains a combination of KCl and (NH₄)₂SO₄ as well as the PCR additive,Factor MP, which increases the local concentration of primers at thetemplate. Factor MP stabilizes specifically bound primers, allowingefficient primer extension by, e.g., HotStarTaq DNA Polymerase.HotStarTaq DNA Polymerase is a modified form of Taq DNA polymerase andhas no polymerase activity at ambient temperatures. The followingthermocycling conditions were used for the first round of PCR: 95° C.for 10 minutes; 25 cycles of 96° C. for 30 seconds, 65° C. for 29minutes, and 72° C. for 30 seconds; and then 72° C. for 2 minutes, and a4° C. hold. For the second round of PCR a 10 ul reaction volume, 1×MM,and 5 nM of each primer was used. The following thermocycling conditionswere used: 95° C. for 15 minutes; 25 cycles of 94° C. for 30 seconds,65° C. for 1 minute, 60° C. for 5 minutes, 65° C. for 5 minutes, and 72°C. for 30 seconds; and then 72° C. for 2 minutes, and a 4° C. hold.

For the library of ˜3,000 primers, exemplary reaction conditions includea 10 ul reaction volume, 2×MM, 70 mM TMAC, and 2 nM primer of eachprimer. For the library of primers for detecting SNVs, exemplaryreaction conditions include a 10 ul reaction volume, 2×MM, 4 mM EDTA,and 7.5 nM primer of each primer. Exemplary thermocycling conditionsinclude 95° C. for 15 minutes, 20 cycles of 94° C. for 30 seconds, 65°C. for 15 minutes, and 72° C. for 30 seconds; and then 72° C. for 2minutes, and a 4° C. hold.

The amplified products were barcoded. One run of sequencing wasperformed with an approximately equal number of reads per sample.

FIGS. 36A and 36B show results from analysis of genomic DNA (FIG. 36A)or DNA from a single cell (FIG. 36B) using a library of approximately28,000 primers designed to detect CNVs. Approximately 4 million readswere measured per sample. The presence of two central bands instead ofone central band indicates the presence of a CNV. For three samples ofDNA from a single cell, the percent of mapped reads was 89.9%, 94.0%,and 93.4%, respectively. For two samples of genomic DNA the percent ofmapped reads was 99.1% for each sample.

FIGS. 37A and 37B show results from analysis of genomic DNA (FIG. 37A)or DNA from a single cell (FIG. 37B) using a library of approximately3,000 primers designed to detect CNVs. Approximately 1.2 million readswere measured per sample. The presence of two central bands instead ofone central band indicates the presence of a CNV. For three samples ofDNA from a single cell, the percent of mapped reads was 98.2%, 98.2%,and 97.9%, respectively. For two samples of genomic DNA the percent ofmapped reads was 98.8% for each sample. FIG. 38 illustrates theuniformity in DOR for these ˜3,000 loci.

For calling SNVs, the call percent for true positive mutations wassimilar for DNA from a single cell and genomic DNA. A graph of callpercent for true positive mutations for single cells on the y-axisversus that for genomic DNA on the x-axis yielded a curve fit ofy=1.0076x−0.3088 with R²=0.9834. FIG. 39 shows similar error callmetrics for genomic DNA and DNA from a single cell. FIG. 40 shows thatthe error rate for detecting transition mutations was greater than fordetecting transversion mutations, indicating it may be desirable toselect transversion mutations for detection rather than transitionmutations when possible.

Example 8

This example further validates a massively multiplexed PCR methodologyfor chromosomal aneuploidy and CNV determination disclosed herein,called CoNVERGe (Copy Number Variant Events Revealed Genotypically), andfurther illustrates the development and use of “PlasmArt” standards forPCR of ctDNA samples. PlasmArt standards include polynucleotides havingsequence identity to regions of the genome known to exhibit CNV and asize distribution that reflects that of cfDNA fragments naturally foundin plasma.

Sample Collection

Human breast cancer cell lines (HCC38, HCC1143, HCC1395, HCC1937,HCC1954, and HCC2218) and matched normal cell lines (HCC38BL, HCC1143BL,HCC1395BL, HCC1937BL, HCC1954BL, and HCC2218BL) were obtained from theAmerican Type Culture Collection (ATCC). Trisomy 21 B-lymphocyte(AG16777) and paired father/child DiGeorge Syndrome (DGS) cell lines(GM10383 and GM10382, respectively) were from the Coriell CellRepository (Camden, N.J.). GM10382 cells only have the paternal 22q11.2region.

We procured tumour tissues from 16 breast cancer patients, including 11fresh frozen (FF) samples from Geneticist (Glendale, Calif.) and fiveformalin-fixed paraffin-embedded (FFPE) samples from North Shore-LIJ(Manhasset, N.Y.). We acquired matched buffy coat samples for eightpatients and matched plasma samples for nine patients. FF tumour tissuesand matched buffy coat and plasma samples from five ovarian cancerpatients were from North Shore-LIJ. For eight breast tumour FF samples,tissue subsections were resected for analysis. Institutional reviewboard approvals from Northshore/LIJ IRB and Kharkiv National MedicalUniversity Ethics Committee were obtained for sample collection andinformed consent was obtained from all subjects.

Blood samples were collected into EDTA tubes. Circulating tumour DNA wasisolated from 1 mL plasma using the QIAamp Circulating Nucleic Acid Kit(Qiagen, Valencia, Calif.).

To make the PlasmArt standards according to one exemplary method, first,9 □ 106 cells were lysed with hypotonic lysis buffer (20 mM Tris-Cl (pH7.5), 10 mM NaCl, and 3 mM MgCl2) for 15 min on ice. Then, 10% IGEPALCA-630 (Sigma, St. Louis, Mo.) was added to a final concentration of0.5%. After centrifugation at 3,000 g for 10 min at 4° C., pelletednuclei were resuspended in 1× micrococcal nuclease (MNase) Buffer (NewEngland BioLabs, Ipswich, Mass.) before adding 1000 U of MNase (NewEngland BioLabs), and then incubated for 5 min at 37° C. Reactions werestopped by adding EDTA to a final concentration of 15 mM. Undigestedchromatin was removed by centrifugation at 2,000 g for 1 min. FragmentedDNA was purified with the DNA Clean & Concentrator™-500 kit (ZymoResearch, Irvine, Calif.). Mononucleosomal DNA produced by MNasedigestion was also purified and size-selected using AMPure XP magneticbeads (Beckman Coulter, Brea, Calif.). DNA fragments were sized andquantified with a Bioanalyzer DNA 1000 chip (Agilent, Santa Clara,Calif.).

To model ctDNA at different concentrations, different fractions ofPlasmArts from HCC1954 and HCC2218 cancer cells were mixed with thosefrom the corresponding matched normal cell line (HCC1954BL andHCC2218BL, respectively). Three samples at each concentration wereanalyzed. Similarly, to model allelic imbalances in plasma DNA in afocal 3.5 Mb region, we generated PlasmArts from DNA mixtures containingdifferent ratios of DNA from a child with a maternal 22q11.2 deletionand DNA from the father. Samples containing only the father's DNA wereused as negative controls. Eight samples at each concentration wereanalyzed.

Accordingly, to evaluate the sensitivity and reproducibility ofCoNVERGe, especially when the proportion of abnormal DNA for a CNV, oraverage allelic imbalance (AAI), is low, we used it to detect CNVs inDNA mixtures comprised of a previously characterized abnormal sampletitrated into a matched normal sample. The mixtures consisted ofartificial cfDNA, termed “PlasmArt”, with fragment size distributionapproximating natural cfDNA (see above). FIG. 42 graphically displaysthe size distribution of an exemplary PlasmArt prepared from a cancercell line compared to the size distribution of cfDNA, looking at CNVs onchromosome arms 1p, 1q, 2p, and 2q. In the first pair, a son's tumor DNAsample having a 3 Mb Focal CNV deletion of the 22q11.2 region wastitrated into a matched normal sample from the father at between 0-1.5%total cfDNA (FIG. 41 a ). CoNVERGe reproducibly identified CNVscorresponding to the known abnormality with estimated AAI of >0.35% inmixtures of ≥0.5%+/−0.2% AAI, failed to detect the CNV in 6/8 replicatesat 0.25% abnormal DNA, and reported a value of ≤0.05% for all eightnegative control samples. The AAI values estimated by CoNVERGe showedhigh linearity (R2=0.940) and reproducibility (error variance=0.087).The assay was sensitive to different levels of amplification within thesame sample. Based on these data a conservative detection threshold of0.45% AAI could be used for subsequent analyses. Using this cutoffanother experiment was performed in which Plasmart synthetic ctDNA wasspiked at known concentrations to create synthetic cancer plasma atbetween around 0.5% and around 3.5%. Negative plasma was also includedas a control. All of the synthetic cancer plasma yielded estimates above0.45% and the reading for the negative plasma was well below 0.45% (FIG.43A-D). FIG. 43A shows the maximum likelihood of tumor, and FIG. 43Bshows an estimate of DNA fraction results as an odds ratio plot. FIG.43C is a plot for the detection of transversion events. FIG. 43D is aplot for the detection of Transition events.

Two additional PlasmArt titrations, prepared from pairs of matched tumorand normal cell line samples and having CNVs on chromosome 1 orchromosome 2, were also evaluated (FIG. 41 b, 41 c ). Among negativecontrols, all values were <0.45%, and high linearity (R2=0.952 forHCC1954 1p, R2=0.993 for HCC1954 1q, R2=0.977 for HCC2218 2p, R2=0.967for HCC2218 2q) and reproducibility (error variance=0.190 for HCC19541p, 0.029 for HCC1954 1q, 0.250 for HCC2218 2p, and 0.350 for HCC22182q) were observed between the known input DNA amount and that calculatedby CoNVERGe. The difference in the slopes of the regressions for regions1p and 1q of one sample pair correlates with the relative difference incopy number observed in the B-allelic frequencies (BAFs) of regions 1pand 1q of the same sample, demonstrating the relative precision of theAAI estimate calculated by CoNVERGe (FIG. 41 c, 41 d ).

The workflow for processing samples is illustrated in FIG. 63 . CoNVERgehas application to a variety of sample sources including FFPE, FreshFrozen, Single Cell, Germline control and cfDNA. We applied CoNVERGe tosix human breast cancer cell lines and matched normal cell lines toassess whether it can detect somatic CNVs. Arm-level and focal CNVs werepresent in all six tumour cell lines, but were absent from their matchednormal cell lines, with the exception of chromosome 2 in HCC1143 inwhich the normal cell line exhibits a deviation from the 1:1 homologratio (FIG. 63 b ). To validate these results on a different platform,we performed CytoSNP-12 microarray analyses, which produced consistentresults for all samples (FIG. 63 d, 63 e ). Moreover, the maximumhomolog ratios for CNVs identified by CoNVERGe and CytoSNP-12microarrays exhibited a strong linear correlation (R2=0.987, P<0.001)(FIG. 630 .

We next applied CoNVERGe to fresh-frozen (FF) (FIG. 64 a ) andformalin-fixed, paraffin-embedded (FFPE) breast tumour tissue samples(FIG. 64 b, 64 d ). In both sample types, several arm-level and focalCNVs were present; however, no CNVs were detected in DNA from matchedbuffy coat samples. CoNVERGe results were highly correlated with thosefrom microarray analyses of the same samples (FIG. 64 e-h ; R2=0.909,P<0.001 for CytoSNP-12 on FF; R2=0.992, P<0.001 for OncoScan on FFPE).CoNVERGe also produces consistent results on small quantities of DNAextracted from laser capture microdissection (LCM) samples, for whichmicroarray methods are not suitable.

Detection of CNVs in Single Cells with CoNVERGe

To test the limits of the applicability of this mmPCR approach, weisolated single cells from the six aforementioned cancer cell lines andfrom a B-lymphocyte cell line that had no CNVs in the target regions.The CNV profiles from these single-cell experiments were consistentbetween three replicates and with those from genomic DNA (gDNA)extracted from a bulk sample of about 20,000 cells (FIG. 65 ). On thebasis of the number of SNPs with no sequencing reads, the average assaydrop-out rate for bulk samples was 0.48% (range: 0.41-0.60%), which isattributable to either synthesis or assay design failure. For singlecells, the additional average assay drop-out rate observed was 0.39%(range: 0.19-0.67%). For single cell assays that did not fail (i.e. noassay drop-out occurred), the average single ADO rate calculated usingheterozygous SNPs only was 0.05% (range: 0.00-0.43%). Additionally, thepercentage of SNPs with high confidence genotypes (i.e. SNP genotypesdetermined with at least 98% confidence) was similar for both singlecell and bulk samples and the genotype in the single cell samplesmatched those in the bulk sample (average 99.52%, range: 92.63-100.00%).

In single cells, allele frequencies are expected to directly reflectchromosome copy numbers, unlike in tumour samples where this may beconfounded by TH and non-tumour cell contamination. BAFs of 1/n and(n−1)/n indicate n chromosome copies in a region. Chromosome copynumbers are indicated on the allele frequency plots for both singlecells and matched gDNA samples (FIG. 65 ).

Application of CoNVERGe to Plasma Samples

To investigate the ability of CoNVERGe to detect CNVs in real plasmasamples, we applied our approach to cfDNA paired with a matched tumourbiopsy from each of two stage II breast cancer patients and fivelate-stage ovarian cancer. In all seven patients, CNVs were detected inboth FF tumour tissues and in the corresponding plasma samples (FIG. 66). FIG. 67 provides a list of SNV breast cancer mutations. A total of 32CNVs, at a level of ≥0.45% AAI, were detected in the seven plasmasamples (range: 0.48-12.99% AAI) over the five regions assayed, whichrepresent about 20% of the genome. Note that the presence of CNVs inplasma cannot be confirmed due to the lack of alternative orthogonalmethods.

Although AAI estimates may appear correlated with BAFs in tumour, directproportionality should not necessarily be expected due to tumourheterogeneity. For example, in sample BC5 (FIG. 66 a ), the ovals at theupper left area of FIG. 66 a indicate regions that have BAFs compatiblewith N=11; combining this with the AAI calculation from the plasmasample leads to estimates for c of 2.33% and 2.67% for the two regions.Estimating c using the other regions in the sample give values between4.46% and 9.53%, which clearly demonstrates the presence of tumorheterogeneity.

These data demonstrate that CNVs can be detected in plasma in asubstantial fraction of samples, and suggest that the more prevalent aCNV is within a tumour, the more likely it is to be observed in cfDNA.Furthermore, CoNVERGe detected CNVs from a liquid biopsy that may haveotherwise gone unobserved in a traditional tumour biopsy.

Example 9

This example provides details regarding certain exemplary samplepreparation methods used for CoNVERGe analysis of different types ofsamples.

Single Cell CNV Protocol for 28,000-plex PCR

Multiplexed PCR allows simultaneous amplification of many targets in asingle reaction. Target SNPs were identified in each genomic region with10% minimum population minor allele frequency (1000 Genomes Projectdata; Apr. 30, 2012 release). For each SNP, multiple primers,semi-nested, were designed to have an amplicon length of a maximumlength of 75 bp and a melting temperature between 54-60.5° C. Primerinteraction scores for all possible combinations of primers werecalculated; primers with high scores were eliminated to reduce thelikelihood of primer dimer product formation. Candidate PCR assays wereranked and selected on the basis of target SNP minor allele frequency,observed heterozygosity rate (from dbSNP), presence in HapMap, andamplicon length.

In certain experiments, single cell samples were prepared and amplifiedusing a mmPCR 28,000-plex protocol. The samples were prepared in thefollowing way: For analysis of a single cell, cells were serial diluteduntil there were 3 or 4 cells per droplet. An individual cell waspipetted and placed into a PCR tube. The cell was lysed using ProteaseK, salt, and DTT using the following conditions: 56° C. for 20 minutes,95° C. for 10 minutes, and then a 4° C. hold. For analysis of genomicDNA, DNA from the same cell line as the analyzed single cell was eitherpurchased or obtained by growing the cells and extracting the DNA. TheDNA was amplified in a 40 uL reaction volume containing Qiagen mp-PCRmaster mix (2×MM final conc), 7.5 nM primer conc. for 28K primer pairshaving a hemi-nested Rev primers under the following conditions: 95 C 10min, 25× [96 C 30 sec, 65 C 29 min, 72 C 30 sec], 72 C 2 min, 4 C hold.The amplification product was diluted 1:200 in water and 2 ul added toSTAR 2 (10 ul reaction volume) 1×MM, 5 nM primer conc. and PCR wasperformed using hemi-nested inner Fwd primer and tag specific Revprimer: 95 C 15 min, 25× [94 C 30 sec, 65 C 1 min, 60 C 5 min, 65 C 5min, 72 C 30 sec], 72 C 2 min, 4 C hold.

Full sequence tags and barcodes were attached to the amplificationproducts and amplified for 9 cycles using adaptor specific primers.Prior to sequencing, the barcoded library product were pooled, purifiedwith the QIAquick PCR Purification Kit (Qiagen), and quantified usingthe Qubit® dsDNA BR Assay Kit (Life Technologies). Amplicons weresequenced using an Illumina HiSeq 2500 sequencer.

Extraction of DNA from a Blood/Plasma Sample

Blood samples were collected into EDTA tubes. The whole blood sample wascentrifuged and separated into three layers: the upper layer, 55% of theblood sample, was plasma and contains cell-free DNA (cfDNA); the buffycoat middle layer contained leucocytes having DNA, <1% of total; and thebottom layer, 45% of the collected blood sample, contained erythrocytes,no DNA was present in this fraction as erythrocytes are enucleated.Circulating tumor DNA was isolated from at least 1 mL plasma using theQIAamp Circulating Nucleic Acid Kit, Qia-Amp (Qiagen, Valencia, Calif.),according to the manufacture's protocol.

Plasma CNV Protocol for 3,168-Plex for Chromosomes 1p, 1q, 2p, 2q, and22q11

Plasma DNA libraries were prepared and amplified using a mmPCR3,168-plex protocol. The samples were prepared in the following way: Upto 20 mL of blood was centrifuged to isolate the buffy coat and theplasma. Plasma extraction of cfDNA and library preparation wasperformed. DNA was eluted in 50 uL TE buffer. The input for mmPCR was6.7 uL of amplified and purified Natera plasma library at an inputamount of approximately 1200 ng. The plasma DNA was amplified in a 20 uLreaction volume containing Qiagen mp-PCR master mix (2×MM final conc), 2nM tagged primer conc. (total 12.7 uM) in 3,168-plex primer pools andPCR amplified: 95 C 10 min, 25× [96 C 30 sec, 65 C 20 min, 72 C 30 sec],72 C 2 min, 4 C hold. The amplification product was diluted 1:2,000 inwater and 1 ul added to the Barcoding-PCR in a 10 uL reaction volume.The barcodes are attached to the amplification products via PCRamplification for 12 cycles using tag specific primers. Products ofmultiple samples are pooled and then purified with QIAquick PCRPurification Kit (Qiagen) and eluted in 50 ul DNA suspension buffer.Samples are sequenced by NGS as described for the Single Cell CNVProtocol for 28,000-plex PCR.

Breast Cancer Feasibility SNV Panel from Plasma

cfDNA from breast cancer patient blood samples was prepared andamplified using 336 primer pairs that were distributed into four 84-plexpools. Natera plasma libraries were prepared as described for Plasma CNVProtocol for 3,168-plex for Chromosomes 1p, 1q, 2p, 2q, and 22q11. DNAwas eluted in 50 uL TE buffer. The input for mPCR was 2.5 uL ofamplified and purified Natera plasma library at an input amount ofapproximately 600 ng. FIG. 68A-B represents the major and minor allelefrequencies of the SNPs used in a 3168 mmPCR reaction. The X-axisrepresents the number of SNPs, from left to right, for chromosome 1q,1p, 2q, 2p and 22q. SNPs were selected from the 1000 Genomes map forHumans, Group 19 and dbSNP to pick targets, but only SNPs from the 1000Genomes were used to screen for minor allele frequencies. The plasma DNAwas amplified in four parallel reactions of 84-plex primer pools, a 10uL reaction volume containing Qiagen mp-PCR master mix (2×MM finalconc.), 4 mM EDTA, 7.5 nM primer concentration (total 1.26 uM) and PCRamplified: 95 C 15 min, 25× [94 C 30 sec, 65 C 15 min, 72 C 30 sec], 72C 2 min, 4 C hold. The amplification product of the 4 subpools were eachdiluted 1:200 in water and 1 ul added to the Barcoding-PCR reaction in a10 uL reaction volume containing Q5 HS HF master mix (1× final), and 1uM each barcoding primer and each of the pools were amplified in thefollowing reaction: 98 C 1 min, 25× [98 C 10 sec, 70 C 10 sec, 60 C 30sec, 65 C 15 sec, 72 C 15 sec], 72 C 2 min, 4 C hold. Libraries werepurified with QIAquick PCR Purification Kit (Qiagen) and eluted in 50 ulDNA suspension buffer. Samples were sequenced by paired end sequencing.

Example 10

This example provides details regarding certain exemplary methods foranalyzing sequencing data to identify SNVs.

SNV METHOD 1: For this embodiment, a background error model wasconstructed using normal plasma samples, which were sequenced on thesame sequencing run to account for run-specific artifacts. In certainembodiments, 5, 10, 15, 20, 25, 30, 40, 50, 100, 150, 200, 250, or morethan 250 normal plasma samples were analyzed on the same sequencing run.In certain illustrative embodiments, 20, 25, 40, or 50 normal plasmasamples are analyzed on the same sequencing run. Noisy positions withnormal median variant allele frequency greater than a cutoff areremoved. For example this cutoff in certain embodiments is >0.1%, 0.2%,0.25%, 0.5%, 1%, 2%, 5%, or 10%. In certain illustrative embodimentsnoisy positions with normal medial variant allele frequency greater than0.5% are removed. Outlier samples were iteratively removed from themodel to account for noise and contamination. In certain embodiments,samples with a Z score of greater than 5, 6, 7, 8, 9, or 10 were removedfrom the data analysis. For each base substitution of every genomicloci, the depth of read weighted mean and standard deviation of theerror were calculated. Tumor or cell-free plasma samples' positions withat least 5 variant reads and a Z-score of 10 against the backgrounderror model were called as a candidate mutation.

SNV METHOD 2: For this embodiment we aim to determine Single NucleotideVariants (SNVs) using plasma ctDNA data. We model the PCR process as astochastic process, estimate the parameters using a training set andmake the final SNV calls using a separate testing set. The main idea isto determine the propagation of the error across multiple PCR cycles,calculate the mean and the variance of the background error, anddifferentiate the background error from real mutations.

-   -   The following parameters are estimated for each base:    -   p=efficiency (probability that each read is replicated in each        cycle)    -   p_(e)=error rate per cycle for mutation type e (probability that        an error of type e occurs)    -   X₀=initial number of molecules    -   As a read is replicated over the course of PCR process, the more        errors occur. Hence, the error profile of the reads is        determined by the degrees of separation from the original read.        We refer to a read as k^(th) generation if it has gone through k        replications until it has been generated.    -   Let us define the following variables for each base:    -   X_(ij)=number of generation i reads generated in the PCR cycle j    -   Y_(ij)=total number of generation i reads at the end of cycle j    -   X_(ij) ^(e)=number of generation i reads with mutation e        generated in the PCR cycle j    -   Moreover, in addition to normal molecules X₀, if there are        additional f_(e)X₀ molecules with the mutation e at the        beginning of the PCR process (hence fe/(1+fe) will be the        fraction of mutated molecules in the initial mixture).    -   Given the total number of generation i−1 reads at cycle j−1, the        number of generation i reads generated at cycle j has a binomial        distribution with a sample size of and probability parameter        of p. Hence, E(X_(ij),|Y_(i-1,j-1),p)=pY_(i-1,j-1) and        Var(X_(ij)|Y_(i-1,j-1), p)=p(1−p)Y_(i-1,j-1).    -   We also have Y_(ij)=Σ_(k=i) ^(j)X_(ik). Hence, by recursion,        simulation or similar methods, we can determine E(X_(ij)).        Similarly, we can determine Var(X_(j))=E(Var(X_(ij),        |p))+Var(E(X_(ij), |p)) using the distribution of p.    -   finally, E(X_(ij) ^(e)|Y_(i-1,j-1),p_(e))=p_(e)Y_(i-1,j-1) and        Var(X_(ij) ^(e)|Y_(i-1,j-1),p)=p_(e)(1−p_(e))Y_(i-1,j-1), and we        can use these to compute E(X_(ij) ^(e)) and Var(X_(ij) ^(e)).

20.

6+0.2 Algorithm

The algorithm starts by estimating the efficiency and error rate percycle using the training set. Let n denote the total number of PCRcycles.

The number of reads R_(b) at each base b can be approximated by(1+p_(b))^(n)X₀, where p_(b) is the efficiency at base b. Then(R_(b)/X₀)^(1/n) can be used to approximate 1+p_(b). Then, we candetermine the mean and the standard variation of p_(b) across alltraining samples, to estimate the parameters of the probabilitydistribution (such as normal, beta, or similar distributions) for eachbase.

Similarly, the number of error e reads R_(b) ^(e) at each base b can beused to estimate p_(e). After determining the mean and the standarddeviation of the error rate across all training samples, we approximateits probability distribution (such as normal, beta, or similardistributions) whose parameters are estimated using this mean andstandard deviation values.

Next, for the testing data, we estimate the initial starting copy ateach base as

$\int_{0}^{1}{\frac{R_{b}}{( {1 + p_{b}} )^{n}}{f( p_{b} )}dp_{b}}$

where f(.) is an estimated distribution from the training set.

$\int_{0}^{1}{\frac{R_{b}}{( {1 + p_{b}} )^{n}}{f( p_{b} )}dp_{b}}$

where f(.) is an estimated distribution from the training set.

Hence, we have estimated the parameters that will be used in thestochastic process. Then, by using these estimates, we can estimate themean and the variance of the molecules created at each cycle (note thatwe do this separately for normal molecules, error molecules, andmutation molecules).

Finally, by using a probabilistic method (such as maximum likelihood orsimilar methods), we can determine the best f_(e) value that fits thedistribution of the error, mutation, and normal molecules the best. Morespecifically, we estimate the expected ratio of the error molecules tototal molecules for various f_(e) values in the final reads, anddetermine the likelihood of our data for each of these values, and thenselect the value with the highest likelihood.

In certain embodiments, Method 2 above is performed as follows:

a) Estimate a PCR efficiency and a per cycle error rate using a trainingdata set;

b) Estimate a number of starting molecules for the testing data set ateach base using the distribution of the efficiency estimated in step(a);

c) If needed, update the estimate of the efficiency for the testing dataset using the starting number of molecules estimated in step (b);

d) Estimate the mean and variance for the total number of molecules,background error molecules and real mutation molecules (for a searchspace consisting of an initial percentage of real mutation molecules)using testing set data and parameters estimated in steps (a), (b) and(c);

e) Fit a distribution to the number of total error molecules (backgrounderror and real mutation) in the total molecules, and calculate thelikelihood for each real mutation percentage in the search space; and

f) Determine the most likely real mutation percentage and calculate theconfidence using the data from in step (e).

Example 11

This example provides results using the multiplexed PCR CoNVERGe methodsprovided herein, for the detection of cancer by detecting CNV incirculating DNA. The Plasma CNV Protocol for 3,168-plex for Chromosomes1p, 1q, 2p, 2q, and 22q11 provided herein, was used. Plasma from 21breast cancer patients (stage I-IIIB) was analyzed. The results shown inFIG. 44 demonstrate that CNVs were detected in all samples using anAAI>=0.45% and required as few as 62 heterozygous SNPs. A similarprotocol was used to analyze plasma from ovarian cancer patients. Usinga 0.45% cutoff, a 100% ovarian cancer detection rate was achieved, asshown in FIG. 45 . Each of the five samples also had a matched tumorsample.

Example 12

This example demonstrates that a dramatic improvement in the ability todetect cancer is achieved by testing plasma for the presence of CNVs andSNVs. CNVs and SNVs were detected using the methods provided in theExamples above. Samples were prepared according to the appropriateprotocols in Example 9. SNVs were identified using SNV Method 1 above.As shown in FIG. 46 , the sensitivity of detecting breast and lungcancer are greatly improved by analyzing plasma from Stage I-III cancerpatients for both CNVs and SNVs versus testing for SNVs alone. AnalyzingSNVs only, 71% of cancers were detected in plasma samples. However byanalyzing for the presence of SNVs and/or CNVs the detection rate goesup to 83% for breast and 92% for lung in the patient populationsanalyzed. If one considers all of the SNVs and CNVs that have beenidentified in the TCGA and COSMIC data sets, the expected diagnosticload would be greater than 97% for breast cancer and >98% for lungcancer.

Further analysis was performed on samples from 41 patient samples withdifferent stages of cancer using the plasma sample prep methods providedin Example 9 and SNV Method 1 provided above. As shown in FIG. 47 , whenassaying for CNVs and SNVs in circulating tumor DNA from breast cancerpatients 60% of Stage I, 88% of Stage II and 100% of Stage III breastcancers were detected using a limit of quantification of 0.2% ctDNA forSNVs and 0.45% ctDNA for CNVs. As shown in FIG. 48 , when assaying forCNVs and SNVs in ctDNA and looking at 41 patient samples with differentsubstages of breast cancer, 60% of Stage I, 100% of Stage II, 90% ofStage IIA, 80% of Stage IIB, and 100% of Stage III, IIIA, and IIIBbreast cancers were detected using a limit of quantification of 0.2%ctDNA for SNVs and 0.45% ctDNA for CNVs. As shown in FIG. 49 , whenassaying for CNVs and SNVs in 24 circulating tumor DNA from lung cancerpatient samples 88% of Stage I, 100% of Stage II and 100% of Stage IIIlung cancers were detected using a limit of quantification of 0.2% ctDNAfor SNVs and 0.45% ctDNA for CNVs. As shown in FIG. 50 , when assayingfor CNVs and SNVs in ctDNA and looking at 24 patient samples withdifferent substages of lung cancer, 100% detection rate was achieved forall substages except that an 82% detection rate was achieved for thepatients with stage IB lung cancer using a limit of quantification of0.2% ctDNA for SNVs and 0.45% ctDNA for CNVs.

Example 13

This example demonstrates that detection of SNV in ctDNA overcomes thelimitations in identifying variant alleles in biopsied samples due totumor heterogeneity. TRACERx samples of three small cell lung cancerpatient samples and one adenocarcinoma lung cancer patient sample forwhich tumor biopsies and corresponding pre-operative blood plasmasamples had been collected were used for analysis of tumorheterogeneity. Samples were obtained from the Cancer Research UK LungCancer Centre of Excellence, University College London Cancer Institute,London WC1E 6BT, UK. Samples were primary lung cancer samples foranalysis of SNV mutations. Two to three biopsies from various regionsfrom the entire cancerous lung were taken from each patient (FIG. 51A).Each biopsied sample was assayed by whole exome sequencing (IlluminaHiSeq200; Illumina, San Diego, Calif.), followed by AmpliSeq® sequencing(Ion Torrent, South San Francisco, Calif.) on a PGM® for identificationof underlying clonal heterogeneity. Following sequencing and SNVanalyses, the variant allele frequency (VAF) was determined for eachbiopsy sample (FIG. 51B).

Plasma samples from each of the four patients were used to isolate ctDNAand identify both clonal and subclonal SNV mutations in plasma toovercome tumor heterogeneity (FIG. 52 ). Clonal populations had VAFallele calls in all biopsied samples assayed and in plasma whilesubclonal populations had VAF allele calls in at least one biopsysample, but not all biopsy samples. The plasma was considered to be acumulative representative of the SNV's found in the ctDNA of eachpatient. Not all SNV's identified by sequencing were able to havecorresponding PCR assays designed.

To compare the AmpliSeq (Swanton) and mmPCR/NGS assay methods foridentifying tumor heterogeneity, Natera designed PCR assays for each SNVmutation for VAF detection in both biopsied and corresponding ctDNA fromplasma (FIG. 53 ). Blank cells represent no biopsy sample available anda zero value represents no VAF detected. The following 11 genes wereinitially identified as a negative (false VAF call) by the AmpliSeq FPor FN assays but were called correctly by the Natera TP or TN assays andmmPCR/NGS assay methods: L12: CYFIP1, FAT1, MLLT4, and RASA1; L13:HERC4, JAK2, MSH2, MTOR, and PLCG2; L15: GABRG1; L17: TRIM67.Surprisingly, when the AmpliSeq raw sequencing data was re-examinedthese results were verified. The raw AmpliSeq data sequencing filesrevealed that the data fell below the PGM or Illumina detectablethreshold setting. The data identified 16/38 variants were detected inplasma and that there were several biopsy samples in the L12 patientsamples that had predominant clonal SNV mutations: L12: BRIP1, CARS,FAT1, MLLT4, NFE2L2, TP53, TP53 as well as patients L13: EGFR, EGFR,TP53 and L15: KDM6A, ROS1. An additional two patients were found to havea total of four subclonal variant mutations in plasma: L12: CIC, KDM6Aand L17; NF1, TRIM67. These results are summarized in FIG. 54A which isa whisker plot of the mean VAF for each sample listed in FIG. 53 by eachassay method and FIG. 54B is a direct comparison represented by a linearregression plot of each assay's VAF sample mean.

Example 14

This example demonstrates that by using low primer concentrations suchthat primer amount is the limiting reactant in multiplex PCR in aworkflow that is followed by next generation sequencing, uniformity ofdensity of reads, and therefore limits of detection, across a pool ofamplification reactions is improved. Some experiments were carried outfor plasma CNV using the 3,168-plex panel according to Example 9 aboveexcept that the total reaction volume was 10 uL instead of 20 uL.Furthermore, PCR was carried out for 15, 20, or 25 cycles. Otherexperiments were carried out using the four 84-plex pools on breastcancer samples according to the protocol of Example 9 except that primerconcentrations were 2 nM and PCR amplification was carried out for 15,20, or 25 cycles.

Not to be limited by theory, it is believed that primer limitedmultiplex PCR provides improved depth of read uniformity for multiplexPCR before multi-read sequencing, such as sequencing on an IlluminaHiSeq or MiSeq system or an Ion Torrent PGM or Proton system, based onthe following considerations: If some of the amplifications in amultiplex PCR have lower efficiencies than others, then with normalmultiplex PCR we will end up with a wide range of depth of read (“DOR”)values. However, if the amount of primer is limited, and the multiplexPCR is cycled more times than what it takes to exhaust the primers, thenthe more efficient amplifications will stop doubling (because they haveno more primers to use) and the less efficient ones will continue todouble; this will result in a more similar amount of amplificationproduct for all of the amplification products. This will translate intoa much more uniform distribution of the DOR.

The following calculations are used to determine the number of cyclesthat would exact a given amount of primer and starting nucleic acidtemplate:

-   -   assume a given starting DNA input level: 100 k copies of each        target (10{circumflex over ( )}5; this is easily achieved with        using amplified library)    -   assume we use 2 nM of each primer as an exemplary concentration,        although other concentrations such as, for example, 0.2, 0.5, 1,        1.5, 2, 2.5, 5, or 10 nM could work too.    -   calculate the number of primer molecules for each primer:        2*10{circumflex over ( )}−9 (molar concentration, 2        nM)×10*10{circumflex over ( )}−6 (reaction volume, 10        ul)×6*10{circumflex over ( )}−23 (number of molecules per mole,        Avogadro's number)=12*10{circumflex over ( )}9    -   calculate the amplification fold needed to consume all primers:        12*10{circumflex over ( )}9 (number of primer        molecules)/10{circumflex over ( )}5 (number of copies of each        target)=12*10{circumflex over ( )}4    -   calculate the number of cycles needed to achieve this        amplification fold, assuming 100% efficiency at each cycle: log        2(12*10{circumflex over ( )}4)=17 cycles. (this is log 2 because        at each cycle, the number of copies doubles).

So for these conditions (100 k copies input, 2 nM primers, 10 ulreaction volume, assuming 100% PCR efficiency at each cycle), theprimers would be consumed after 17 PCR cycles.

However, the key assumption is that some of the products DO NOT have100% efficiency, so without measuring their efficiencies (which is onlypracticable for a small number of them anyway), it would take more than17 cycles to consume them.

FIGS. 55-58 show results for the four 84-plex SNV PCR primer pools. Foreach of the pools we observed improved DOR efficiency with increasingcycles from 15 to 20 to 25. Similar results were obtained forexperiments using the 3,168-plex panel (FIGS. 59-61 ). The limit ofdetection decreased (i.e. SNV sensitivity increased) with increasingdepth of read. Furthermore, the sensitivity was consistently better whendetecting transversion mutations than transition mutations. It is likelythat additional increases in DOR efficiency can be obtained withadditional cycles when using primer-limiting multiplex PCR beforemulti-read sequencing.

Accordingly, in one aspect provided herein is a method of amplifying aplurality of target loci in a nucleic acid sample that includes (i)contacting the nucleic acid sample with a library of primers and otherprimer extension reaction components to provide a reaction mixture,wherein the relative amount of each primer in the reaction mixturecompared to the other primer extension reaction components creates areaction wherein the primers are present at a limiting concentration,and wherein the primers hybridize to a plurality of different targetloci; and (ii) subjecting the reaction mixture to primer extensionreaction conditions for sufficient number of cycles to consume orexhaust the primers in the library of primers, to produce amplifiedproducts that include target amplicons. For example, the plurality ofdifferent target loci can include at least 2, 3, 5, 10, 25, 50, 100,200, 250, 500, 1,000; 2,000; 5,000; 7,500; 10,000; 20,000; 25,000;30,000; 40,000; 50,000; 75,000; or 100,000 different target loci, and atmost, 50, 100, 200, 250, 500, 1,000; 2,000; 5,000; 7,500; 10,000;20,000; 25,000; 30,000; 40,000; 50,000; 75,000; 100,000, 200,000,250,000, 500,000, and 1,000,000 different target loci to produce areaction mixture.

The method in illustrative embodiments, includes determining an amountof primer that will be a rate limiting amount. This calculationtypically includes estimating and/or determining the number of targetmolecules and involves analyzing and/or determining the number ofamplification cycles performed. For example, in illustrativeembodiments, the concentration of each primer is less than 100, 75, 50,25, 10, 5, 2, 1, 0.5, 0.25, 0.2 or 0.1 nM. In various embodiments, theGC content of the primers is between 30 to 80%, such as between 40 to70% or 50 to 60%, inclusive. In some embodiments, the range of GCcontent (e.g., the maximum GC content minus minimum GC content, such as80%−60%=a range of 20%) of the primers is less than 30, 20, 10, or 5%.In some embodiments, the melting temperature (T_(m)) of the primers isbetween 40 to 80° C., such as 50 to 70° C., 55 to 65° C., or 57 to 60.5°C., inclusive. In some embodiments, the range of melting temperatures ofthe primers is less than 20, 15, 10, 5, 3, or 1° C. In some embodiments,the length of the primers is between 15 to 100 nucleotides, such asbetween 15 to 75 nucleotides, 15 to 40 nucleotides, 17 to 35nucleotides, 18 to 30 nucleotides, 20 to 65 nucleotides, inclusive. Insome embodiments, the primers include a tag that is not target specific,such as a tag that forms an internal loop structure. In someembodiments, the tag is between two DNA binding regions. In variousembodiments, the primers include a 5′ region that is specific for atarget locus, an internal region that is not specific for the targetlocus and forms a loop structure, and a 3′ region that is specific forthe target locus. In various embodiments, the length of the 3′ region isat least 7 nucleotides. In some embodiments, the length of the 3′ regionis between 7 and 20 nucleotides, such as between 7 to 15 nucleotides, or7 to 10 nucleotides, inclusive. In various embodiments, the test primersinclude a 5′ region that is not specific for a target locus (such as atag or a universal primer binding site) followed by a region that isspecific for a target locus, an internal region that is not specific forthe target locus and forms a loop structure, and a 3′ region that isspecific for the target locus. In some embodiments, the range of thelength of the primers is less than 50, 40, 30, 20, 10, or 5 nucleotides.In some embodiments, the length of the target amplicons is between 50and 100 nucleotides, such as between 60 and 80 nucleotides, or 60 to 75nucleotides, inclusive. In some embodiments, the range of the length ofthe target amplicons is less than 100, 75, 50, 25, 15, 10, or 5nucleotides.

In various embodiments of any of the aspects of the invention, theprimer extension reaction conditions are polymerase chain reactionconditions (PCR). In various embodiments, the length of the annealingstep is greater than 3, 5, 8, 10, or 15 minutes but less than 240, 120,60, or 30 minutes. In various embodiments, the length of the extensionstep is greater than 3, 5, 8, 10, or 15 minutes but less than 240, 120,60 or 30 minutes.

Example 15

This Example demonstrates the ability of the SNV detection methods ofthe present invention to identify mosaicism in single cell analysis alsoreferred to as single molecule analysis. FIG. 62 shows multiplex PCRresults from tumor cell genomic DNA and single cell/molecule inputsusing the 28K-plex primer set according to the 28K single cell methodprovided in Example 9. Using this method, greater than 85% of reads weremapped—over 4.7M reads (about 167 reads per target). The lower portionof the figure shows that mosaicism was observed among cells.

What is claimed is:
 1. A method for preparing a sample of a subjecthaving cancer or suspected of having cancer useful for identifying oneor more tumor-specific variants in the blood, plasma, serum, or urinesample, the method comprising: (a) performing whole exome sequencing orwhole genome sequencing on nucleic acids derived from a tumor sample ofthe subject and identifying a plurality of tumor-specific variants; (b)selectively enriching or amplifying a plurality of target loci fromcell-free DNA derived from a blood, plasma, serum, or urine sample ofthe subject to obtain selectively enriched or amplified DNA, wherein 10to 2,000 of the target loci each encompasses a different tumor-specificvariant identified in the tumor sample of the subject; and (c)sequencing the selectively enriched or amplified DNA and obtainingsequence reads with a depth of read of at least 25,000 per target locus,and detecting one or more of the tumor-specific variants present in thecell-free DNA from the sequence reads.
 2. The method of claim 1, whereinstep (c) comprises sequencing the selectively enriched or amplified DNAand obtaining sequence reads with a depth of read of at least 50,000 pertarget locus.
 3. The method of claim 1, wherein the tumor sample of thesubject is from a solid tumor.
 4. The method of claim 1, wherein thetumor sample of the subject is a tumor tissue sample.
 5. The method ofclaim 1, wherein the cell-free DNA comprises circulating tumor DNA. 6.The method of claim 1, wherein the tumor-specific variants comprise oneor more clonal SNV mutations.
 7. The method of claim 1, wherein step (a)further comprises determining clonal heterogeneity of the tumor sample.8. The method of claim 1, wherein step (b) comprises selectivelyenriches or amplifies 20 to 200 target loci each encompassing adifferent tumor-specific variant identified in the tumor sample of thesubject.
 9. The method of claim 1, wherein step (b) comprisesselectively enriches or amplifies 200 to 2,000 target loci eachencompassing a different tumor-specific variant identified in the tumorsample of the subject.
 10. The method of claim 1, wherein the methodfurther comprises designing PCR primers or hybrid capture probestargeting the tumor-specific variants identified in the tumor sample ofthe subject, and selectively enriching or amplifying the target lociusing the designing PCR primers or hybrid capture probes.
 11. The methodof claim 1, wherein the method further comprises performing barcodingPCR prior to step (c).
 12. The method of claim 1, wherein the methodfurther comprises detecting recurrence and/or metastases of the cancerfrom the tumor-specific variants detected in the cell-free DNA.
 13. Themethod of claim 1, wherein the subject is a human subject.
 14. Themethod of claim 1, wherein the cancer is colorectal cancer, lung cancer,bladder cancer, or breast cancer.
 15. The method of claim 1, wherein themethod is capable of detecting a tumor-specific variant at a limit ofdetection of less than or equal to 0.015%.
 16. A method for preparing asample of a subject having cancer or suspected of having cancer usefulfor identifying one or more tumor-specific variants in the blood,plasma, serum, or urine sample, the method comprising: (a) performingwhole exome sequencing or whole genome sequencing on nucleic acidsderived from a tumor sample of the subject and identifying a pluralityof tumor-specific variants; (b) selectively enriching or amplifying aplurality of target loci from cell-free DNA derived from a blood,plasma, serum, or urine sample of the subject to obtain selectivelyenriched or amplified DNA, wherein 10 to 2,000 of the target loci eachencompasses a different tumor-specific variant identified in the tumorsample of the subject; and (c) sequencing the selectively enriched oramplified DNA and obtaining sequence reads, and detecting one or more ofthe tumor-specific variants present in the cell-free DNA from thesequence reads, wherein the method is capable of detecting atumor-specific variant at a limit of detection of less than or equal to0.05%.
 17. The method of claim 16, wherein the method is capable ofdetecting a tumor-specific variant at a limit of detection of less thanor equal to 0.015%.
 18. The method of claim 16, wherein the tumor sampleof the subject is from a solid tumor.
 19. The method of claim 16,wherein the tumor sample of the subject is a tumor tissue sample. 20.The method of claim 16, wherein the cell-free DNA comprises circulatingtumor DNA.
 21. The method of claim 16, wherein the tumor-specificvariants comprise one or more clonal SNV mutations.
 22. The method ofclaim 16, wherein step (a) further comprises determining clonalheterogeneity of the tumor sample.
 23. The method of claim 16, whereinstep (b) comprises selectively enriches or amplifies 20 to 200 targetloci each encompassing a different tumor-specific variant identified inthe tumor sample of the subject.
 24. The method of claim 16, whereinstep (b) comprises selectively enriches or amplifies 200 to 2,000 targetloci each encompassing a different tumor-specific variant identified inthe tumor sample of the subject.
 25. The method of claim 16, wherein themethod further comprises designing PCR primers or hybrid capture probestargeting the tumor-specific variants identified in the tumor sample ofthe subject, and selectively enriching or amplifying the target lociusing the designing PCR primers or hybrid capture probes.
 26. The methodof claim 16, wherein the method further comprises performing barcodingPCR prior to step (c).
 27. The method of claim 16, wherein the methodfurther comprises detecting recurrence and/or metastases of the cancerfrom the tumor-specific variants detected in the cell-free DNA.
 28. Themethod of claim 16, wherein the subject is a human subject.
 29. Themethod of claim 16, wherein the cancer is colorectal cancer, lungcancer, bladder cancer, or breast cancer.
 30. The method of claim 16,wherein step (c) comprises sequencing the selectively enriched oramplified DNA and obtaining sequence reads with a depth of read of atleast 50,000 per target locus.